跳转到内容

监控和日志

业务场景

云咖啡公司的系统已经上线运行,但我们需要实时监控系统运行状态,快速定位和解决问题。为了实现系统可观测性,我们需要部署监控和日志系统。

需求:

  • 部署 Prometheus 监控系统
  • 部署 Grafana 可视化面板
  • 收集应用日志
  • 配置告警规则

学习目标

完成本课程后,你将掌握:

  • Prometheus 的部署和配置
  • Grafana 的使用和仪表板创建
  • 日志收集和分析
  • 告警规则配置
  • 系统可观测性的实现

前置准备

1. 确认环境

bash
# 检查命名空间
kubectl get namespace cloud-cafe

# 检查现有资源
kubectl get all -n cloud-cafe

# 检查 metrics-server
kubectl get pods -n kube-system | grep metrics

2. 创建监控命名空间

bash
# 创建监控命名空间
kubectl create namespace monitoring

# 查看命名空间
kubectl get namespaces

实战步骤

Step 1: 部署 Prometheus

概念: Prometheus 是一个开源的监控和告警工具,用于收集和存储时间序列数据。

1.1 创建 Prometheus ConfigMap

bash
# 创建 Prometheus 配置
kubectl create configmap prometheus-config \
  --from-literal=prometheus.yml='
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
' \
  -n monitoring

# 查看 ConfigMap
kubectl get configmap prometheus-config -n monitoring

1.2 部署 Prometheus

bash
# 创建 Prometheus Deployment YAML 文件
cat > prometheus-deployment.yaml << 'EOF'
# Prometheus 监控系统 Deployment
# 用途:部署 Prometheus 监控服务器
# 功能:
#   - 收集和存储时间序列指标数据
#   - 提供 PromQL 查询接口
#   - 支持告警规则评估
# 配置说明:
#   - config.file:主配置文件路径
#   - storage.tsdb.path:时序数据存储路径
#   - web.enable-lifecycle:启用配置热重载(通过 /-/reload)
# 前置依赖:需要 prometheus-config ConfigMap 已创建
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1                   # Prometheus 单节点部署(高可用需要 Thanos 或 Cortex)
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest    # Prometheus 官方镜像
        args:
          # Prometheus 启动参数
          - '--config.file=/etc/prometheus/prometheus.yml'           # 主配置文件
          - '--storage.tsdb.path=/prometheus'                        # TSDB 存储路径
          - '--web.console.libraries=/usr/share/prometheus/console_libraries'    # 控制台库
          - '--web.console.templates=/usr/share/prometheus/consoles'             # 控制台模板
          - '--web.enable-lifecycle'                                 # 启用生命周期 API(支持热重载)
        ports:
        - containerPort: 9090
          name: web                 # Web UI 和 API 端口
        # 存储挂载
        volumeMounts:
        - name: prometheus-config
          mountPath: /etc/prometheus    # 配置文件目录
        - name: prometheus-data
          mountPath: /prometheus        # 数据存储目录
        # 资源限制(Prometheus 需要较多内存)
        resources:
          requests:
            memory: "512Mi"       # 最低内存要求
            cpu: "250m"           # 最低 CPU(0.25核)
          limits:
            memory: "1Gi"         # 最大内存限制
            cpu: "500m"           # 最大 CPU(0.5核)
        # 健康检查
        livenessProbe:
          httpGet:
            path: /-/healthy      # Prometheus 健康检查端点
            port: 9090
          initialDelaySeconds: 30    # 首次检查延迟(Prometheus 启动需要时间加载 WAL)
          periodSeconds: 10          # 检查间隔
        readinessProbe:
          httpGet:
            path: /-/ready        # Prometheus 就绪检查端点
            port: 9090
          initialDelaySeconds: 10    # 首次检查延迟
          periodSeconds: 5           # 检查间隔
      # 卷配置
      volumes:
      - name: prometheus-config
        configMap:
          name: prometheus-config     # 引用 Prometheus 配置 ConfigMap
      - name: prometheus-data
        emptyDir: {}                 # 使用 emptyDir(生产环境建议使用 PVC)
EOF

# 应用 Deployment
kubectl apply -f prometheus-deployment.yaml

# 等待 Prometheus 就绪
kubectl rollout status deployment/prometheus -n monitoring

# 查看 Pod
kubectl get pods -n monitoring

1.3 创建 Prometheus Service

bash
# 创建 Prometheus Service
kubectl expose deployment prometheus \
  --port=9090 \
  --target-port=9090 \
  --name=prometheus-svc \
  -n monitoring

# 查看 Service
kubectl get svc prometheus-svc -n monitoring

1.4 访问 Prometheus

bash
# 创建 NodePort Service
kubectl expose deployment prometheus \
  --port=9090 \
  --target-port=9090 \
  --name=prometheus-nodeport \
  --type=NodePort \
  -n monitoring

# 获取 NodePort
PROMETHEUS_PORT=$(kubectl get svc prometheus-nodeport -n monitoring -o jsonpath='{.spec.ports[0].nodePort}')
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

echo "Prometheus 访问地址: http://$NODE_IP:$PROMETHEUS_PORT"

在浏览器中访问:http://192.168.56.10:xxxxx

你应该看到 Prometheus 的 Web UI。


Step 2: 为应用添加 Prometheus 指标

我们需要为应用添加 Prometheus 指标,以便 Prometheus 可以收集数据。

2.1 更新后端服务,添加 Prometheus 指标

我们需要为应用添加 Prometheus 指标,以便 Prometheus 可以收集数据。

首先,创建/编辑 order-backend-deployment.yaml 文件:

bash
# 创建/编辑后端服务 Deployment 文件
vim order-backend-deployment.yaml

order-backend-deployment.yaml 内容(添加 Prometheus 指标):

点击查看 order-backend-deployment.yaml 内容
yaml
# 订单后端服务 Deployment
# 本次修改:添加 Prometheus 指标暴露
# 修改位置:
#   1. metadata.annotations 添加 prometheus.io/scrape 注解,让 Prometheus 自动发现
#   2. pip install 添加 prometheus-client 依赖
#   3. Python 代码中添加指标定义、装饰器和暴露端点
# 前置依赖:需要先创建 Redis、MySQL 和相关 ConfigMap/Secret
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-backend
  namespace: cloud-cafe
  labels:
    app: order-backend
spec:
  replicas: 2
  selector:
    matchLabels:
      app: order-backend
  template:
    metadata:
      labels:
        app: order-backend
      # [新增开始] Prometheus 自动发现注解
      # 这些注解告诉 Prometheus 如何抓取此应用的指标
      annotations:
        prometheus.io/scrape: "true"      # 启用指标采集
        prometheus.io/port: "5000"        # 指标暴露端口
        prometheus.io/path: "/metrics"    # 指标访问路径
      # [新增结束]
    spec:
      containers:
      - name: order-backend
        image: python:3.9-slim
        command: ["/bin/sh", "-c"]
        args:
          - |
            # [修改开始] 添加 prometheus-client 依赖
            # prometheus-client 是 Python 的官方 Prometheus 客户端库
            pip install flask pymysql flask-cors redis prometheus-client
            # [修改结束]
            cat > /app/app.py << 'PYEOF'
            from flask import Flask, request, jsonify
            from flask_cors import CORS
            import pymysql
            import redis
            import os
            import json
            # [新增开始] 导入 Prometheus 客户端库
            from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
            # Counter: 只增不减的计数器,适合记录请求总数
            # Histogram: 直方图,适合记录请求耗时分布
            # [新增结束]
            from datetime import timedelta

            app = Flask(__name__)
            CORS(app)

            # [新增开始] 定义 Prometheus 指标
            # order_requests_total: 记录 HTTP 请求总数,按方法、端点、状态码分类
            order_requests_total = Counter('order_requests_total', 'Total number of order requests', ['method', 'endpoint', 'status'])
            # order_duration_seconds: 记录请求处理耗时分布
            order_duration_seconds = Histogram('order_request_duration_seconds', 'Order request duration')
            # db_query_duration_seconds: 记录数据库查询耗时
            db_query_duration_seconds = Histogram('db_query_duration_seconds', 'Database query duration')
            # cache_operations_total: 记录缓存操作次数(命中/未命中)
            cache_operations_total = Counter('cache_operations_total', 'Total number of cache operations', ['operation', 'status'])
            # [新增结束]

            # 数据库配置
            db_config = {
                'host': os.getenv('DB_HOST', 'mysql-service'),
                'port': int(os.getenv('DB_PORT', 3306)),
                'user': os.getenv('DB_USER', 'cafeadmin'),
                'password': os.getenv('DB_PASSWORD', 'userpassword123'),
                'database': os.getenv('DB_NAME', 'cloudcafe')
            }

            # Redis 配置
            redis_client = redis.Redis(
                host=os.getenv('REDIS_HOST', 'redis-svc'),
                port=int(os.getenv('REDIS_PORT', 6379)),
                decode_responses=True
            )

            def get_db_connection():
                return pymysql.connect(**db_config)

            # [新增开始] Prometheus 指标暴露端点
            # Prometheus 会访问此端点抓取指标数据
            @app.route('/metrics')
            def metrics():
                return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
            # [新增结束]

            @app.route('/health')
            @order_duration_seconds.time()  # [新增] 装饰器:自动记录函数执行时间
            def health():
                order_requests_total.labels(method='GET', endpoint='/health', status='200').inc()  # [新增] 计数器+1
                return jsonify({'status': 'healthy', 'redis': 'connected' if redis_client.ping() else 'disconnected'})

            @app.route('/orders', methods=['GET'])
            @order_duration_seconds.time()  # [新增] 记录请求耗时
            def get_orders():
                try:
                    # 尝试从缓存获取
                    cache_key = 'orders:all'
                    cached_orders = redis_client.get(cache_key)
                    
                    if cached_orders:
                        cache_operations_total.labels(operation='get', status='hit').inc()  # [新增] 缓存命中计数
                        app.logger.info('Orders retrieved from cache')
                        order_requests_total.labels(method='GET', endpoint='/orders', status='200').inc()  # [新增]
                        return jsonify(json.loads(cached_orders))
                    
                    # 缓存未命中,从数据库获取
                    with db_query_duration_seconds.time():  # [新增] 记录数据库查询耗时
                        conn = get_db_connection()
                        cursor = conn.cursor(pymysql.cursors.DictCursor)
                        cursor.execute('SELECT * FROM orders ORDER BY order_time DESC LIMIT 20')
                        orders = cursor.fetchall()
                        conn.close()
                    
                    # 存入缓存,过期时间 60 秒
                    redis_client.setex(cache_key, 60, json.dumps(orders))
                    cache_operations_total.labels(operation='get', status='miss').inc()  # [新增] 缓存未命中计数
                    app.logger.info('Orders retrieved from database and cached')
                    
                    order_requests_total.labels(method='GET', endpoint='/orders', status='200').inc()  # [新增]
                    return jsonify(orders)
                except Exception as e:
                    app.logger.error(f'Error getting orders: {str(e)}')
                    order_requests_total.labels(method='GET', endpoint='/orders', status='500').inc()  # [新增] 错误计数
                    return jsonify({'error': str(e)}), 500

            @app.route('/orders', methods=['POST'])
            @order_duration_seconds.time()  # [新增] 记录请求耗时
            def create_order():
                try:
                    data = request.json
                    with db_query_duration_seconds.time():  # [新增] 记录数据库查询耗时
                        conn = get_db_connection()
                        cursor = conn.cursor()
                        cursor.execute(
                            'INSERT INTO orders (customer_name, coffee_type, quantity, total_price) VALUES (%s, %s, %s, %s)',
                            (data['customer_name'], data['coffee_type'], data['quantity'], data['total_price'])
                        )
                        conn.commit()
                        order_id = cursor.lastrowid
                        conn.close()
                    
                    # 清除缓存
                    redis_client.delete('orders:all')
                    cache_operations_total.labels(operation='delete', status='success').inc()  # [新增] 缓存删除计数
                    app.logger.info(f'Order {order_id} created, cache cleared')
                    
                    order_requests_total.labels(method='POST', endpoint='/orders', status='201').inc()  # [新增]
                    return jsonify({'order_id': order_id, 'message': 'Order created successfully'}), 201
                except Exception as e:
                    app.logger.error(f'Error creating order: {str(e)}')
                    order_requests_total.labels(method='POST', endpoint='/orders', status='500').inc()  # [新增] 错误计数
                    return jsonify({'error': str(e)}), 500

            @app.route('/cache/stats', methods=['GET'])
            @order_duration_seconds.time()  # [新增] 记录请求耗时
            def cache_stats():
                try:
                    info = redis_client.info('stats')
                    order_requests_total.labels(method='GET', endpoint='/cache/stats', status='200').inc()  # [新增]
                    return jsonify({
                        'total_commands_processed': info.get('total_commands_processed', 0),
                        'total_connections_received': info.get('total_connections_received', 0),
                        'keyspace_hits': info.get('keyspace_hits', 0),
                        'keyspace_misses': info.get('keyspace_misses', 0)
                    })
                except Exception as e:
                    order_requests_total.labels(method='GET', endpoint='/cache/stats', status='500').inc()  # [新增] 错误计数
                    return jsonify({'error': str(e)}), 500

            if __name__ == '__main__':
                app.run(host='0.0.0.0', port=5000)
            PYEOF
            python /app/app.py
        ports:
        - containerPort: 5000
        env:
        - name: DB_HOST
          valueFrom:
            configMapKeyRef:
              name: app-config
              key: DB_HOST
        - name: DB_PORT
          valueFrom:
            configMapKeyRef:
              name: app-config
              key: DB_PORT
        - name: DB_USER
          valueFrom:
            configMapKeyRef:
              name: mysql-config
              key: MYSQL_USER
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mysql-secret
              key: MYSQL_PASSWORD
        - name: DB_NAME
          valueFrom:
            configMapKeyRef:
              name: mysql-config
              key: MYSQL_DATABASE
        - name: REDIS_HOST
          valueFrom:
            configMapKeyRef:
              name: app-config
              key: REDIS_HOST
        - name: REDIS_PORT
          valueFrom:
            configMapKeyRef:
              name: app-config
              key: REDIS_PORT
        - name: FLASK_ENV
          valueFrom:
            configMapKeyRef:
              name: order-backend-config
              key: FLASK_ENV
        - name: FLASK_DEBUG
          valueFrom:
            configMapKeyRef:
              name: order-backend-config
              key: FLASK_DEBUG
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
        livenessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 10
          periodSeconds: 5
        volumeMounts:
        - name: app-logs
          mountPath: /app/logs
      volumes:
      - name: app-logs
        persistentVolumeClaim:
          claimName: app-log-pvc

保存文件后,应用配置:

bash
# 应用后端服务 Deployment
kubectl apply -f order-backend-deployment.yaml

# 等待后端服务更新完成
kubectl rollout status deployment/order-backend -n cloud-cafe

# 查看 Pod
kubectl get pods -n cloud-cafe

2.2 测试 Prometheus 指标

bash
# 获取后端服务 Pod 名称
BACKEND_POD=$(kubectl get pod -l app=order-backend -n cloud-cafe -o jsonpath='{.items[0].metadata.name}')

# 测试指标端点
kubectl exec -it $BACKEND_POD -n cloud-cafe -- curl http://localhost:5000/metrics

# 在 Prometheus UI 中查看指标
# 访问: http://$NODE_IP:$PROMETHEUS_PORT
# 在查询框中输入: order_requests_total

Step 3: 部署 Grafana

概念: Grafana 是一个开源的可视化工具,可以创建美观的仪表板来展示 Prometheus 收集的数据。

3.1 部署 Grafana

bash
# 创建 Grafana Deployment YAML 文件
cat > grafana-deployment.yaml << 'EOF'
# Grafana 可视化监控平台 Deployment
# 用途:部署 Grafana 仪表板服务
# 功能:
#   - 连接 Prometheus 数据源
#   - 创建和展示监控仪表板
#   - 支持告警通知
# 环境变量说明:
#   - GF_SECURITY_ADMIN_USER:管理员用户名
#   - GF_SECURITY_ADMIN_PASSWORD:管理员密码(生产环境应使用 Secret)
#   - GF_INSTALL_PLUGINS:预安装插件列表
# 前置依赖:建议先部署 Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
  labels:
    app: grafana
spec:
  replicas: 1                   # Grafana 单节点部署(高可用需要共享存储)
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:latest    # Grafana 官方镜像
        ports:
        - containerPort: 3000
          name: web                 # Web UI 端口
        # 环境变量配置
        env:
        - name: GF_SECURITY_ADMIN_USER
          value: "admin"            # 管理员用户名
        - name: GF_SECURITY_ADMIN_PASSWORD
          value: "admin"            # 管理员密码(默认,建议修改)
        - name: GF_INSTALL_PLUGINS
          value: ""                 # 预安装插件(逗号分隔)
        # 资源限制
        resources:
          requests:
            memory: "256Mi"         # 最低内存要求
            cpu: "100m"             # 最低 CPU(0.1核)
          limits:
            memory: "512Mi"         # 最大内存限制
            cpu: "200m"             # 最大 CPU(0.2核)
        # 健康检查
        livenessProbe:
          httpGet:
            path: /api/health       # Grafana 健康检查 API
            port: 3000
          initialDelaySeconds: 30    # 首次检查延迟(Grafana 启动需要时间)
          periodSeconds: 10          # 检查间隔
        readinessProbe:
          httpGet:
            path: /api/health       # Grafana 就绪检查 API
            port: 3000
          initialDelaySeconds: 10    # 首次检查延迟
          periodSeconds: 5           # 检查间隔
        # 数据持久化
        volumeMounts:
        - name: grafana-data
          mountPath: /var/lib/grafana    # Grafana 数据目录(仪表板配置等)
      volumes:
      - name: grafana-data
        emptyDir: {}                 # 使用 emptyDir(生产环境建议使用 PVC)
EOF

# 应用 Deployment
kubectl apply -f grafana-deployment.yaml

# 等待 Grafana 就绪
kubectl rollout status deployment/grafana -n monitoring

# 查看 Pod
kubectl get pods -n monitoring

3.2 创建 Grafana Service

bash
# 创建 Grafana Service
kubectl expose deployment grafana \
  --port=3000 \
  --target-port=3000 \
  --name=grafana-svc \
  -n monitoring

# 创建 NodePort Service
kubectl expose deployment grafana \
  --port=3000 \
  --target-port=3000 \
  --name=grafana-nodeport \
  --type=NodePort \
  -n monitoring

# 获取 NodePort
GRAFANA_PORT=$(kubectl get svc grafana-nodeport -n monitoring -o jsonpath='{.spec.ports[0].nodePort}')

echo "Grafana 访问地址: http://$NODE_IP:$GRAFANA_PORT"
echo "用户名: admin"
echo "密码: admin"

在浏览器中访问:http://192.168.56.10:xxxxx

使用用户名 admin 和密码 admin 登录。

3.3 配置 Grafana 数据源

  1. 登录 Grafana
  2. 点击左侧菜单的 "Configuration" → "Data Sources"
  3. 点击 "Add data source"
  4. 选择 "Prometheus"
  5. 配置数据源:
    • Name: Prometheus
    • URL: http://prometheus-svc.monitoring.svc.cluster.local:9090
  6. 点击 "Save & Test"

3.4 创建仪表板

  1. 点击左侧菜单的 "+" → "Dashboard"
  2. 点击 "Add new panel"
  3. 配置面板:
    • Title: Order Requests Total
    • Query: sum(rate(order_requests_total[5m])) by (endpoint, status)
    • Visualization: Time series
  4. 点击 "Apply"
  5. 添加更多面板:
    • Order Request Duration: histogram_quantile(0.95, rate(order_request_duration_seconds_bucket[5m]))
    • Database Query Duration: histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m]))
    • Cache Operations: sum(rate(cache_operations_total[5m])) by (operation, status)

Step 4: 配置告警规则

4.1 创建告警规则 ConfigMap

bash
# 创建告警规则
kubectl create configmap prometheus-rules \
  --from-literal=alerts.yml='
groups:
  - name: cloud-cafe-alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(order_requests_total{status="500"}[5m])) /
          sum(rate(order_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 5% for the last 5 minutes"

      - alert: HighResponseTime
        expr: |
          histogram_quantile(0.95, rate(order_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time detected"
          description: "95th percentile response time is above 1 second"

      - alert: HighCacheMissRate
        expr: |
          sum(rate(cache_operations_total{operation="get",status="miss"}[5m])) /
          sum(rate(cache_operations_total{operation="get"}[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High cache miss rate detected"
          description: "Cache miss rate is above 50% for the last 5 minutes"

      - alert: PodNotReady
        expr: |
          kube_pod_status_ready{namespace="cloud-cafe",condition="true"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod not ready"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not ready"
' \
  -n monitoring

# 查看 ConfigMap
kubectl get configmap prometheus-rules -n monitoring

4.2 更新 Prometheus 配置

bash
# 更新 Prometheus 配置,添加告警规则
kubectl create configmap prometheus-config \
  --from-literal=prometheus.yml='
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: []

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  - job_name: "kubernetes-nodes"
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: "(.*):10250"
        replacement: "${1}:9100"
        target_label: __address__
' \
  -n monitoring \
  --dry-run=client -o yaml | kubectl apply -f -

# 创建更新后的 Prometheus Deployment YAML 文件(挂载告警规则)
cat > prometheus-deployment.yaml << 'EOF'
# Prometheus 监控系统 Deployment(集成告警规则版)
# 本次修改:添加告警规则挂载
# 修改位置:
#   1. 新增 prometheus-rules 卷挂载
#   2. 在 args 中添加 rule_files 配置(在 prometheus.yml 中配置)
# 告警规则说明:
#   - 规则文件挂载到 /etc/prometheus/rules/
#   - Prometheus 会自动加载这些规则并评估
# 前置依赖:需要 prometheus-config 和 prometheus-rules ConfigMap 已创建
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.console.libraries=/usr/share/prometheus/console_libraries'
          - '--web.console.templates=/usr/share/prometheus/consoles'
          - '--web.enable-lifecycle'
        ports:
        - containerPort: 9090
          name: web
        volumeMounts:
        - name: prometheus-config
          mountPath: /etc/prometheus
        # [新增开始] 挂载告警规则
        - name: prometheus-rules
          mountPath: /etc/prometheus/rules    # 告警规则目录
        # [新增结束]
        - name: prometheus-data
          mountPath: /prometheus
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 9090
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 9090
          initialDelaySeconds: 10
          periodSeconds: 5
      volumes:
      - name: prometheus-config
        configMap:
          name: prometheus-config
      # [新增开始] 告警规则卷
      - name: prometheus-rules
        configMap:
          name: prometheus-rules    # 引用告警规则 ConfigMap
      # [新增结束]
      - name: prometheus-data
        emptyDir: {}
EOF

# 应用更新后的 Deployment
kubectl apply -f prometheus-deployment.yaml

# 等待 Prometheus 更新完成
kubectl rollout status deployment/prometheus -n monitoring

# 查看 Pod
kubectl get pods -n monitoring

4.3 查看告警规则

在 Prometheus UI 中查看告警规则:

  1. 访问 http://$NODE_IP:$PROMETHEUS_PORT
  2. 点击 "Status" → "Rules"
  3. 查看所有告警规则

Step 5: 日志收集

5.1 查看应用日志

bash
# 查看后端服务日志
kubectl logs -l app=order-backend -n cloud-cafe --tail=50

# 实时查看日志
kubectl logs -l app=order-backend -n cloud-cafe -f

# 查看特定 Pod 的日志
kubectl logs <pod-name> -n cloud-cafe

# 查看前一个容器的日志(如果容器重启了)
kubectl logs <pod-name> -n cloud-cafe --previous

📌 关于 kubectl logs 参数的解释

kubectl logs 用于查看 Pod 容器的日志,支持多种过滤和查看方式。

常用参数

参数说明示例
--tail=N只显示最后 N 行日志--tail=50 查看最后 50 行
-f / --follow实时跟踪日志输出类似 tail -f
--previous / -p查看之前容器的日志容器重启后排查问题
--since=DURATION查看最近一段时间的日志--since=10m 最近 10 分钟
--since-time=TIME查看指定时间之后的日志ISO 8601 格式
--all-namespaces / -A查看所有命名空间配合 -l 使用

使用场景

bash
# 排查 CrashLoopBackOff(查看上次失败的日志)
kubectl logs my-pod --previous

# 实时跟踪日志并过滤错误
kubectl logs -l app=myapp -f | grep ERROR

# 查看今天所有的错误日志
kubectl logs -l app=myapp --since=24h | grep -i error

提示:如果 Pod 有多个容器,需要加 -c <容器名> 指定容器

5.2 使用 kubectl 搜索日志

bash
# 搜索包含特定关键词的日志
kubectl logs -l app=order-backend -n cloud-cafe | grep "error"

# 搜索最近 10 分钟的日志
kubectl logs -l app=order-backend -n cloud-cafe --since=10m

# 搜索特定时间段的日志
kubectl logs -l app=order-backend -n cloud-cafe --since-time="2024-01-01T00:00:00Z"

# 查看所有命名空间的日志
kubectl logs --all-namespaces --selector=app=order-backend

验证和测试

1. 检查所有资源状态

bash
# 查看 Deployment
kubectl get deployment -n monitoring

# 查看 Pod
kubectl get pods -n monitoring

# 查看 Service
kubectl get svc -n monitoring

# 查看 ConfigMap
kubectl get configmap -n monitoring

2. 测试 Prometheus

bash
# 访问 Prometheus UI
echo "Prometheus 访问地址: http://$NODE_IP:$PROMETHEUS_PORT"

# 测试查询
curl -s "http://$NODE_IP:$PROMETHEUS_PORT/api/v1/query?query=order_requests_total" | jq

# 查看目标
curl -s "http://$NODE_IP:$PROMETHEUS_PORT/api/v1/targets" | jq

3. 测试 Grafana

bash
# 访问 Grafana UI
echo "Grafana 访问地址: http://$NODE_IP:$GRAFANA_PORT"
echo "用户名: admin"
echo "密码: admin"

在 Grafana 中:

  1. 配置 Prometheus 数据源
  2. 创建仪表板
  3. 查看监控数据

4. 测试告警规则

bash
# 在 Prometheus UI 中查看告警规则
# 访问: http://$NODE_IP:$PROMETHEUS_PORT
# 点击: Status → Rules

# 查看当前告警
curl -s "http://$NODE_IP:$PROMETHEUS_PORT/api/v1/alerts" | jq

5. 测试日志收集

bash
# 生成一些日志
for i in {1..10}; do
  curl -X POST -H "Host: cloudcafe.local" -H "Content-Type: application/json" \
    -d "{\"customer_name\":\"日志测试$i\",\"coffee_type\":\"拿铁\",\"quantity\":1,\"total_price\":25.00}" \
    http://$NODE_IP:$INGRESS_PORT/api/orders
done

# 查看日志
kubectl logs -l app=order-backend -n cloud-cafe --tail=20

📝 总结和思考

本课程学到的知识点

  1. Prometheus: 监控数据收集和存储
  2. Grafana: 监控数据可视化
  3. Prometheus 指标: Counter、Histogram 等指标类型
  4. 告警规则: 配置和管理告警
  5. 日志收集: 查看和分析应用日志

关键概念

  • 可观测性: 监控、日志、追踪三位一体
  • 指标类型: Counter(计数器)、Gauge(仪表盘)、Histogram(直方图)
  • 告警规则: 基于指标阈值触发告警
  • 日志分析: 通过日志定位问题

思考题

  1. Prometheus 和 Grafana 有什么区别?它们如何协作?
  2. Counter、Gauge、Histogram 有什么区别?分别在什么场景下使用?
  3. 如何实现告警通知?(提示:Alertmanager)
  4. 如何实现分布式追踪?(提示:Jaeger、Zipkin)
  5. 如何实现日志聚合和分析?(提示:ELK Stack、Loki)

最佳实践

  1. 合理设置指标: 不要收集过多指标,避免性能问题
  2. 设置合理的告警阈值: 避免告警风暴
  3. 定期检查告警规则: 确保告警规则有效
  4. 保留日志: 设置合理的日志保留策略
  5. 监控监控系统: 确保 Prometheus 和 Grafana 本身也被监控

下一步

本课程学习了如何部署监控系统,实现系统可观测性。

下一课程将学习如何使用 Helm 和 CI/CD,实现自动化部署。

下一课程: 07-自动化部署.md


清理环境

如果你想清理本课程创建的资源:

bash
# 删除监控资源
kubectl delete namespace monitoring

# 删除其他资源
kubectl delete hpa order-backend frontend -n cloud-cafe
kubectl delete deployment redis order-backend frontend -n cloud-cafe
kubectl delete svc redis-svc order-backend-svc frontend-svc -n cloud-cafe
kubectl delete statefulset mysql -n cloud-cafe
kubectl delete svc mysql-service -n cloud-cafe
kubectl delete pvc redis-pvc mysql-pvc app-log-pvc -n cloud-cafe
kubectl delete configmap redis-config frontend-html mysql-config app-config order-backend-config -n cloud-cafe
kubectl delete secret mysql-secret mysql-secret-manual -n cloud-cafe
kubectl delete ingress cloud-cafe-ingress -n cloud-cafe

# 删除命名空间
kubectl delete namespace cloud-cafe

提示: 如果你要继续学习下一个课程,建议保留这些资源,因为下一个课程会在此基础上进行。

评论区

专业的Linux技术学习平台,从入门到精通的完整学习路径