主题
监控和日志
业务场景
云咖啡公司的系统已经上线运行,但我们需要实时监控系统运行状态,快速定位和解决问题。为了实现系统可观测性,我们需要部署监控和日志系统。
需求:
- 部署 Prometheus 监控系统
- 部署 Grafana 可视化面板
- 收集应用日志
- 配置告警规则
学习目标
完成本课程后,你将掌握:
- Prometheus 的部署和配置
- Grafana 的使用和仪表板创建
- 日志收集和分析
- 告警规则配置
- 系统可观测性的实现
前置准备
1. 确认环境
bash
# 检查命名空间
kubectl get namespace cloud-cafe
# 检查现有资源
kubectl get all -n cloud-cafe
# 检查 metrics-server
kubectl get pods -n kube-system | grep metrics2. 创建监控命名空间
bash
# 创建监控命名空间
kubectl create namespace monitoring
# 查看命名空间
kubectl get namespaces实战步骤
Step 1: 部署 Prometheus
概念: Prometheus 是一个开源的监控和告警工具,用于收集和存储时间序列数据。
1.1 创建 Prometheus ConfigMap
bash
# 创建 Prometheus 配置
kubectl create configmap prometheus-config \
--from-literal=prometheus.yml='
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
' \
-n monitoring
# 查看 ConfigMap
kubectl get configmap prometheus-config -n monitoring1.2 部署 Prometheus
bash
# 创建 Prometheus Deployment YAML 文件
cat > prometheus-deployment.yaml << 'EOF'
# Prometheus 监控系统 Deployment
# 用途:部署 Prometheus 监控服务器
# 功能:
# - 收集和存储时间序列指标数据
# - 提供 PromQL 查询接口
# - 支持告警规则评估
# 配置说明:
# - config.file:主配置文件路径
# - storage.tsdb.path:时序数据存储路径
# - web.enable-lifecycle:启用配置热重载(通过 /-/reload)
# 前置依赖:需要 prometheus-config ConfigMap 已创建
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
replicas: 1 # Prometheus 单节点部署(高可用需要 Thanos 或 Cortex)
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest # Prometheus 官方镜像
args:
# Prometheus 启动参数
- '--config.file=/etc/prometheus/prometheus.yml' # 主配置文件
- '--storage.tsdb.path=/prometheus' # TSDB 存储路径
- '--web.console.libraries=/usr/share/prometheus/console_libraries' # 控制台库
- '--web.console.templates=/usr/share/prometheus/consoles' # 控制台模板
- '--web.enable-lifecycle' # 启用生命周期 API(支持热重载)
ports:
- containerPort: 9090
name: web # Web UI 和 API 端口
# 存储挂载
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus # 配置文件目录
- name: prometheus-data
mountPath: /prometheus # 数据存储目录
# 资源限制(Prometheus 需要较多内存)
resources:
requests:
memory: "512Mi" # 最低内存要求
cpu: "250m" # 最低 CPU(0.25核)
limits:
memory: "1Gi" # 最大内存限制
cpu: "500m" # 最大 CPU(0.5核)
# 健康检查
livenessProbe:
httpGet:
path: /-/healthy # Prometheus 健康检查端点
port: 9090
initialDelaySeconds: 30 # 首次检查延迟(Prometheus 启动需要时间加载 WAL)
periodSeconds: 10 # 检查间隔
readinessProbe:
httpGet:
path: /-/ready # Prometheus 就绪检查端点
port: 9090
initialDelaySeconds: 10 # 首次检查延迟
periodSeconds: 5 # 检查间隔
# 卷配置
volumes:
- name: prometheus-config
configMap:
name: prometheus-config # 引用 Prometheus 配置 ConfigMap
- name: prometheus-data
emptyDir: {} # 使用 emptyDir(生产环境建议使用 PVC)
EOF
# 应用 Deployment
kubectl apply -f prometheus-deployment.yaml
# 等待 Prometheus 就绪
kubectl rollout status deployment/prometheus -n monitoring
# 查看 Pod
kubectl get pods -n monitoring1.3 创建 Prometheus Service
bash
# 创建 Prometheus Service
kubectl expose deployment prometheus \
--port=9090 \
--target-port=9090 \
--name=prometheus-svc \
-n monitoring
# 查看 Service
kubectl get svc prometheus-svc -n monitoring1.4 访问 Prometheus
bash
# 创建 NodePort Service
kubectl expose deployment prometheus \
--port=9090 \
--target-port=9090 \
--name=prometheus-nodeport \
--type=NodePort \
-n monitoring
# 获取 NodePort
PROMETHEUS_PORT=$(kubectl get svc prometheus-nodeport -n monitoring -o jsonpath='{.spec.ports[0].nodePort}')
NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
echo "Prometheus 访问地址: http://$NODE_IP:$PROMETHEUS_PORT"在浏览器中访问:http://192.168.56.10:xxxxx
你应该看到 Prometheus 的 Web UI。
Step 2: 为应用添加 Prometheus 指标
我们需要为应用添加 Prometheus 指标,以便 Prometheus 可以收集数据。
2.1 更新后端服务,添加 Prometheus 指标
我们需要为应用添加 Prometheus 指标,以便 Prometheus 可以收集数据。
首先,创建/编辑 order-backend-deployment.yaml 文件:
bash
# 创建/编辑后端服务 Deployment 文件
vim order-backend-deployment.yamlorder-backend-deployment.yaml 内容(添加 Prometheus 指标):
点击查看 order-backend-deployment.yaml 内容
yaml
# 订单后端服务 Deployment
# 本次修改:添加 Prometheus 指标暴露
# 修改位置:
# 1. metadata.annotations 添加 prometheus.io/scrape 注解,让 Prometheus 自动发现
# 2. pip install 添加 prometheus-client 依赖
# 3. Python 代码中添加指标定义、装饰器和暴露端点
# 前置依赖:需要先创建 Redis、MySQL 和相关 ConfigMap/Secret
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-backend
namespace: cloud-cafe
labels:
app: order-backend
spec:
replicas: 2
selector:
matchLabels:
app: order-backend
template:
metadata:
labels:
app: order-backend
# [新增开始] Prometheus 自动发现注解
# 这些注解告诉 Prometheus 如何抓取此应用的指标
annotations:
prometheus.io/scrape: "true" # 启用指标采集
prometheus.io/port: "5000" # 指标暴露端口
prometheus.io/path: "/metrics" # 指标访问路径
# [新增结束]
spec:
containers:
- name: order-backend
image: python:3.9-slim
command: ["/bin/sh", "-c"]
args:
- |
# [修改开始] 添加 prometheus-client 依赖
# prometheus-client 是 Python 的官方 Prometheus 客户端库
pip install flask pymysql flask-cors redis prometheus-client
# [修改结束]
cat > /app/app.py << 'PYEOF'
from flask import Flask, request, jsonify
from flask_cors import CORS
import pymysql
import redis
import os
import json
# [新增开始] 导入 Prometheus 客户端库
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
# Counter: 只增不减的计数器,适合记录请求总数
# Histogram: 直方图,适合记录请求耗时分布
# [新增结束]
from datetime import timedelta
app = Flask(__name__)
CORS(app)
# [新增开始] 定义 Prometheus 指标
# order_requests_total: 记录 HTTP 请求总数,按方法、端点、状态码分类
order_requests_total = Counter('order_requests_total', 'Total number of order requests', ['method', 'endpoint', 'status'])
# order_duration_seconds: 记录请求处理耗时分布
order_duration_seconds = Histogram('order_request_duration_seconds', 'Order request duration')
# db_query_duration_seconds: 记录数据库查询耗时
db_query_duration_seconds = Histogram('db_query_duration_seconds', 'Database query duration')
# cache_operations_total: 记录缓存操作次数(命中/未命中)
cache_operations_total = Counter('cache_operations_total', 'Total number of cache operations', ['operation', 'status'])
# [新增结束]
# 数据库配置
db_config = {
'host': os.getenv('DB_HOST', 'mysql-service'),
'port': int(os.getenv('DB_PORT', 3306)),
'user': os.getenv('DB_USER', 'cafeadmin'),
'password': os.getenv('DB_PASSWORD', 'userpassword123'),
'database': os.getenv('DB_NAME', 'cloudcafe')
}
# Redis 配置
redis_client = redis.Redis(
host=os.getenv('REDIS_HOST', 'redis-svc'),
port=int(os.getenv('REDIS_PORT', 6379)),
decode_responses=True
)
def get_db_connection():
return pymysql.connect(**db_config)
# [新增开始] Prometheus 指标暴露端点
# Prometheus 会访问此端点抓取指标数据
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
# [新增结束]
@app.route('/health')
@order_duration_seconds.time() # [新增] 装饰器:自动记录函数执行时间
def health():
order_requests_total.labels(method='GET', endpoint='/health', status='200').inc() # [新增] 计数器+1
return jsonify({'status': 'healthy', 'redis': 'connected' if redis_client.ping() else 'disconnected'})
@app.route('/orders', methods=['GET'])
@order_duration_seconds.time() # [新增] 记录请求耗时
def get_orders():
try:
# 尝试从缓存获取
cache_key = 'orders:all'
cached_orders = redis_client.get(cache_key)
if cached_orders:
cache_operations_total.labels(operation='get', status='hit').inc() # [新增] 缓存命中计数
app.logger.info('Orders retrieved from cache')
order_requests_total.labels(method='GET', endpoint='/orders', status='200').inc() # [新增]
return jsonify(json.loads(cached_orders))
# 缓存未命中,从数据库获取
with db_query_duration_seconds.time(): # [新增] 记录数据库查询耗时
conn = get_db_connection()
cursor = conn.cursor(pymysql.cursors.DictCursor)
cursor.execute('SELECT * FROM orders ORDER BY order_time DESC LIMIT 20')
orders = cursor.fetchall()
conn.close()
# 存入缓存,过期时间 60 秒
redis_client.setex(cache_key, 60, json.dumps(orders))
cache_operations_total.labels(operation='get', status='miss').inc() # [新增] 缓存未命中计数
app.logger.info('Orders retrieved from database and cached')
order_requests_total.labels(method='GET', endpoint='/orders', status='200').inc() # [新增]
return jsonify(orders)
except Exception as e:
app.logger.error(f'Error getting orders: {str(e)}')
order_requests_total.labels(method='GET', endpoint='/orders', status='500').inc() # [新增] 错误计数
return jsonify({'error': str(e)}), 500
@app.route('/orders', methods=['POST'])
@order_duration_seconds.time() # [新增] 记录请求耗时
def create_order():
try:
data = request.json
with db_query_duration_seconds.time(): # [新增] 记录数据库查询耗时
conn = get_db_connection()
cursor = conn.cursor()
cursor.execute(
'INSERT INTO orders (customer_name, coffee_type, quantity, total_price) VALUES (%s, %s, %s, %s)',
(data['customer_name'], data['coffee_type'], data['quantity'], data['total_price'])
)
conn.commit()
order_id = cursor.lastrowid
conn.close()
# 清除缓存
redis_client.delete('orders:all')
cache_operations_total.labels(operation='delete', status='success').inc() # [新增] 缓存删除计数
app.logger.info(f'Order {order_id} created, cache cleared')
order_requests_total.labels(method='POST', endpoint='/orders', status='201').inc() # [新增]
return jsonify({'order_id': order_id, 'message': 'Order created successfully'}), 201
except Exception as e:
app.logger.error(f'Error creating order: {str(e)}')
order_requests_total.labels(method='POST', endpoint='/orders', status='500').inc() # [新增] 错误计数
return jsonify({'error': str(e)}), 500
@app.route('/cache/stats', methods=['GET'])
@order_duration_seconds.time() # [新增] 记录请求耗时
def cache_stats():
try:
info = redis_client.info('stats')
order_requests_total.labels(method='GET', endpoint='/cache/stats', status='200').inc() # [新增]
return jsonify({
'total_commands_processed': info.get('total_commands_processed', 0),
'total_connections_received': info.get('total_connections_received', 0),
'keyspace_hits': info.get('keyspace_hits', 0),
'keyspace_misses': info.get('keyspace_misses', 0)
})
except Exception as e:
order_requests_total.labels(method='GET', endpoint='/cache/stats', status='500').inc() # [新增] 错误计数
return jsonify({'error': str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
PYEOF
python /app/app.py
ports:
- containerPort: 5000
env:
- name: DB_HOST
valueFrom:
configMapKeyRef:
name: app-config
key: DB_HOST
- name: DB_PORT
valueFrom:
configMapKeyRef:
name: app-config
key: DB_PORT
- name: DB_USER
valueFrom:
configMapKeyRef:
name: mysql-config
key: MYSQL_USER
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-secret
key: MYSQL_PASSWORD
- name: DB_NAME
valueFrom:
configMapKeyRef:
name: mysql-config
key: MYSQL_DATABASE
- name: REDIS_HOST
valueFrom:
configMapKeyRef:
name: app-config
key: REDIS_HOST
- name: REDIS_PORT
valueFrom:
configMapKeyRef:
name: app-config
key: REDIS_PORT
- name: FLASK_ENV
valueFrom:
configMapKeyRef:
name: order-backend-config
key: FLASK_ENV
- name: FLASK_DEBUG
valueFrom:
configMapKeyRef:
name: order-backend-config
key: FLASK_DEBUG
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 10
periodSeconds: 5
volumeMounts:
- name: app-logs
mountPath: /app/logs
volumes:
- name: app-logs
persistentVolumeClaim:
claimName: app-log-pvc保存文件后,应用配置:
bash
# 应用后端服务 Deployment
kubectl apply -f order-backend-deployment.yaml
# 等待后端服务更新完成
kubectl rollout status deployment/order-backend -n cloud-cafe
# 查看 Pod
kubectl get pods -n cloud-cafe2.2 测试 Prometheus 指标
bash
# 获取后端服务 Pod 名称
BACKEND_POD=$(kubectl get pod -l app=order-backend -n cloud-cafe -o jsonpath='{.items[0].metadata.name}')
# 测试指标端点
kubectl exec -it $BACKEND_POD -n cloud-cafe -- curl http://localhost:5000/metrics
# 在 Prometheus UI 中查看指标
# 访问: http://$NODE_IP:$PROMETHEUS_PORT
# 在查询框中输入: order_requests_totalStep 3: 部署 Grafana
概念: Grafana 是一个开源的可视化工具,可以创建美观的仪表板来展示 Prometheus 收集的数据。
3.1 部署 Grafana
bash
# 创建 Grafana Deployment YAML 文件
cat > grafana-deployment.yaml << 'EOF'
# Grafana 可视化监控平台 Deployment
# 用途:部署 Grafana 仪表板服务
# 功能:
# - 连接 Prometheus 数据源
# - 创建和展示监控仪表板
# - 支持告警通知
# 环境变量说明:
# - GF_SECURITY_ADMIN_USER:管理员用户名
# - GF_SECURITY_ADMIN_PASSWORD:管理员密码(生产环境应使用 Secret)
# - GF_INSTALL_PLUGINS:预安装插件列表
# 前置依赖:建议先部署 Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
labels:
app: grafana
spec:
replicas: 1 # Grafana 单节点部署(高可用需要共享存储)
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:latest # Grafana 官方镜像
ports:
- containerPort: 3000
name: web # Web UI 端口
# 环境变量配置
env:
- name: GF_SECURITY_ADMIN_USER
value: "admin" # 管理员用户名
- name: GF_SECURITY_ADMIN_PASSWORD
value: "admin" # 管理员密码(默认,建议修改)
- name: GF_INSTALL_PLUGINS
value: "" # 预安装插件(逗号分隔)
# 资源限制
resources:
requests:
memory: "256Mi" # 最低内存要求
cpu: "100m" # 最低 CPU(0.1核)
limits:
memory: "512Mi" # 最大内存限制
cpu: "200m" # 最大 CPU(0.2核)
# 健康检查
livenessProbe:
httpGet:
path: /api/health # Grafana 健康检查 API
port: 3000
initialDelaySeconds: 30 # 首次检查延迟(Grafana 启动需要时间)
periodSeconds: 10 # 检查间隔
readinessProbe:
httpGet:
path: /api/health # Grafana 就绪检查 API
port: 3000
initialDelaySeconds: 10 # 首次检查延迟
periodSeconds: 5 # 检查间隔
# 数据持久化
volumeMounts:
- name: grafana-data
mountPath: /var/lib/grafana # Grafana 数据目录(仪表板配置等)
volumes:
- name: grafana-data
emptyDir: {} # 使用 emptyDir(生产环境建议使用 PVC)
EOF
# 应用 Deployment
kubectl apply -f grafana-deployment.yaml
# 等待 Grafana 就绪
kubectl rollout status deployment/grafana -n monitoring
# 查看 Pod
kubectl get pods -n monitoring3.2 创建 Grafana Service
bash
# 创建 Grafana Service
kubectl expose deployment grafana \
--port=3000 \
--target-port=3000 \
--name=grafana-svc \
-n monitoring
# 创建 NodePort Service
kubectl expose deployment grafana \
--port=3000 \
--target-port=3000 \
--name=grafana-nodeport \
--type=NodePort \
-n monitoring
# 获取 NodePort
GRAFANA_PORT=$(kubectl get svc grafana-nodeport -n monitoring -o jsonpath='{.spec.ports[0].nodePort}')
echo "Grafana 访问地址: http://$NODE_IP:$GRAFANA_PORT"
echo "用户名: admin"
echo "密码: admin"在浏览器中访问:http://192.168.56.10:xxxxx
使用用户名 admin 和密码 admin 登录。
3.3 配置 Grafana 数据源
- 登录 Grafana
- 点击左侧菜单的 "Configuration" → "Data Sources"
- 点击 "Add data source"
- 选择 "Prometheus"
- 配置数据源:
- Name:
Prometheus - URL:
http://prometheus-svc.monitoring.svc.cluster.local:9090
- Name:
- 点击 "Save & Test"
3.4 创建仪表板
- 点击左侧菜单的 "+" → "Dashboard"
- 点击 "Add new panel"
- 配置面板:
- Title:
Order Requests Total - Query:
sum(rate(order_requests_total[5m])) by (endpoint, status) - Visualization:
Time series
- Title:
- 点击 "Apply"
- 添加更多面板:
Order Request Duration:histogram_quantile(0.95, rate(order_request_duration_seconds_bucket[5m]))Database Query Duration:histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m]))Cache Operations:sum(rate(cache_operations_total[5m])) by (operation, status)
Step 4: 配置告警规则
4.1 创建告警规则 ConfigMap
bash
# 创建告警规则
kubectl create configmap prometheus-rules \
--from-literal=alerts.yml='
groups:
- name: cloud-cafe-alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(order_requests_total{status="500"}[5m])) /
sum(rate(order_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for the last 5 minutes"
- alert: HighResponseTime
expr: |
histogram_quantile(0.95, rate(order_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is above 1 second"
- alert: HighCacheMissRate
expr: |
sum(rate(cache_operations_total{operation="get",status="miss"}[5m])) /
sum(rate(cache_operations_total{operation="get"}[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High cache miss rate detected"
description: "Cache miss rate is above 50% for the last 5 minutes"
- alert: PodNotReady
expr: |
kube_pod_status_ready{namespace="cloud-cafe",condition="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod not ready"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not ready"
' \
-n monitoring
# 查看 ConfigMap
kubectl get configmap prometheus-rules -n monitoring4.2 更新 Prometheus 配置
bash
# 更新 Prometheus 配置,添加告警规则
kubectl create configmap prometheus-config \
--from-literal=prometheus.yml='
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: []
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: "kubernetes-nodes"
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: "(.*):10250"
replacement: "${1}:9100"
target_label: __address__
' \
-n monitoring \
--dry-run=client -o yaml | kubectl apply -f -
# 创建更新后的 Prometheus Deployment YAML 文件(挂载告警规则)
cat > prometheus-deployment.yaml << 'EOF'
# Prometheus 监控系统 Deployment(集成告警规则版)
# 本次修改:添加告警规则挂载
# 修改位置:
# 1. 新增 prometheus-rules 卷挂载
# 2. 在 args 中添加 rule_files 配置(在 prometheus.yml 中配置)
# 告警规则说明:
# - 规则文件挂载到 /etc/prometheus/rules/
# - Prometheus 会自动加载这些规则并评估
# 前置依赖:需要 prometheus-config 和 prometheus-rules ConfigMap 已创建
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- containerPort: 9090
name: web
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
# [新增开始] 挂载告警规则
- name: prometheus-rules
mountPath: /etc/prometheus/rules # 告警规则目录
# [新增结束]
- name: prometheus-data
mountPath: /prometheus
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 10
periodSeconds: 5
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
# [新增开始] 告警规则卷
- name: prometheus-rules
configMap:
name: prometheus-rules # 引用告警规则 ConfigMap
# [新增结束]
- name: prometheus-data
emptyDir: {}
EOF
# 应用更新后的 Deployment
kubectl apply -f prometheus-deployment.yaml
# 等待 Prometheus 更新完成
kubectl rollout status deployment/prometheus -n monitoring
# 查看 Pod
kubectl get pods -n monitoring4.3 查看告警规则
在 Prometheus UI 中查看告警规则:
- 访问
http://$NODE_IP:$PROMETHEUS_PORT - 点击 "Status" → "Rules"
- 查看所有告警规则
Step 5: 日志收集
5.1 查看应用日志
bash
# 查看后端服务日志
kubectl logs -l app=order-backend -n cloud-cafe --tail=50
# 实时查看日志
kubectl logs -l app=order-backend -n cloud-cafe -f
# 查看特定 Pod 的日志
kubectl logs <pod-name> -n cloud-cafe
# 查看前一个容器的日志(如果容器重启了)
kubectl logs <pod-name> -n cloud-cafe --previous📌 关于
kubectl logs参数的解释
kubectl logs用于查看 Pod 容器的日志,支持多种过滤和查看方式。常用参数:
参数 说明 示例 --tail=N只显示最后 N 行日志 --tail=50查看最后 50 行-f/--follow实时跟踪日志输出 类似 tail -f--previous/-p查看之前容器的日志 容器重启后排查问题 --since=DURATION查看最近一段时间的日志 --since=10m最近 10 分钟--since-time=TIME查看指定时间之后的日志 ISO 8601 格式 --all-namespaces/-A查看所有命名空间 配合 -l使用使用场景:
bash# 排查 CrashLoopBackOff(查看上次失败的日志) kubectl logs my-pod --previous # 实时跟踪日志并过滤错误 kubectl logs -l app=myapp -f | grep ERROR # 查看今天所有的错误日志 kubectl logs -l app=myapp --since=24h | grep -i error提示:如果 Pod 有多个容器,需要加
-c <容器名>指定容器
5.2 使用 kubectl 搜索日志
bash
# 搜索包含特定关键词的日志
kubectl logs -l app=order-backend -n cloud-cafe | grep "error"
# 搜索最近 10 分钟的日志
kubectl logs -l app=order-backend -n cloud-cafe --since=10m
# 搜索特定时间段的日志
kubectl logs -l app=order-backend -n cloud-cafe --since-time="2024-01-01T00:00:00Z"
# 查看所有命名空间的日志
kubectl logs --all-namespaces --selector=app=order-backend验证和测试
1. 检查所有资源状态
bash
# 查看 Deployment
kubectl get deployment -n monitoring
# 查看 Pod
kubectl get pods -n monitoring
# 查看 Service
kubectl get svc -n monitoring
# 查看 ConfigMap
kubectl get configmap -n monitoring2. 测试 Prometheus
bash
# 访问 Prometheus UI
echo "Prometheus 访问地址: http://$NODE_IP:$PROMETHEUS_PORT"
# 测试查询
curl -s "http://$NODE_IP:$PROMETHEUS_PORT/api/v1/query?query=order_requests_total" | jq
# 查看目标
curl -s "http://$NODE_IP:$PROMETHEUS_PORT/api/v1/targets" | jq3. 测试 Grafana
bash
# 访问 Grafana UI
echo "Grafana 访问地址: http://$NODE_IP:$GRAFANA_PORT"
echo "用户名: admin"
echo "密码: admin"在 Grafana 中:
- 配置 Prometheus 数据源
- 创建仪表板
- 查看监控数据
4. 测试告警规则
bash
# 在 Prometheus UI 中查看告警规则
# 访问: http://$NODE_IP:$PROMETHEUS_PORT
# 点击: Status → Rules
# 查看当前告警
curl -s "http://$NODE_IP:$PROMETHEUS_PORT/api/v1/alerts" | jq5. 测试日志收集
bash
# 生成一些日志
for i in {1..10}; do
curl -X POST -H "Host: cloudcafe.local" -H "Content-Type: application/json" \
-d "{\"customer_name\":\"日志测试$i\",\"coffee_type\":\"拿铁\",\"quantity\":1,\"total_price\":25.00}" \
http://$NODE_IP:$INGRESS_PORT/api/orders
done
# 查看日志
kubectl logs -l app=order-backend -n cloud-cafe --tail=20📝 总结和思考
本课程学到的知识点
- Prometheus: 监控数据收集和存储
- Grafana: 监控数据可视化
- Prometheus 指标: Counter、Histogram 等指标类型
- 告警规则: 配置和管理告警
- 日志收集: 查看和分析应用日志
关键概念
- 可观测性: 监控、日志、追踪三位一体
- 指标类型: Counter(计数器)、Gauge(仪表盘)、Histogram(直方图)
- 告警规则: 基于指标阈值触发告警
- 日志分析: 通过日志定位问题
思考题
- Prometheus 和 Grafana 有什么区别?它们如何协作?
- Counter、Gauge、Histogram 有什么区别?分别在什么场景下使用?
- 如何实现告警通知?(提示:Alertmanager)
- 如何实现分布式追踪?(提示:Jaeger、Zipkin)
- 如何实现日志聚合和分析?(提示:ELK Stack、Loki)
最佳实践
- 合理设置指标: 不要收集过多指标,避免性能问题
- 设置合理的告警阈值: 避免告警风暴
- 定期检查告警规则: 确保告警规则有效
- 保留日志: 设置合理的日志保留策略
- 监控监控系统: 确保 Prometheus 和 Grafana 本身也被监控
下一步
本课程学习了如何部署监控系统,实现系统可观测性。
下一课程将学习如何使用 Helm 和 CI/CD,实现自动化部署。
下一课程: 07-自动化部署.md
清理环境
如果你想清理本课程创建的资源:
bash
# 删除监控资源
kubectl delete namespace monitoring
# 删除其他资源
kubectl delete hpa order-backend frontend -n cloud-cafe
kubectl delete deployment redis order-backend frontend -n cloud-cafe
kubectl delete svc redis-svc order-backend-svc frontend-svc -n cloud-cafe
kubectl delete statefulset mysql -n cloud-cafe
kubectl delete svc mysql-service -n cloud-cafe
kubectl delete pvc redis-pvc mysql-pvc app-log-pvc -n cloud-cafe
kubectl delete configmap redis-config frontend-html mysql-config app-config order-backend-config -n cloud-cafe
kubectl delete secret mysql-secret mysql-secret-manual -n cloud-cafe
kubectl delete ingress cloud-cafe-ingress -n cloud-cafe
# 删除命名空间
kubectl delete namespace cloud-cafe提示: 如果你要继续学习下一个课程,建议保留这些资源,因为下一个课程会在此基础上进行。