监控告警配置

课程介绍

监控告警是监控系统的重要组成部分，用于在系统出现问题时及时通知运维人员。本课程将详细讲解Prometheus告警配置、Alertmanager配置、告警规则、告警通知等核心功能，帮助你搭建完整的监控告警系统。

1. 监控告警概述

1.1 什么是监控告警

监控告警是指在系统出现问题时，通过邮件、短信、钉钉、企业微信等方式及时通知运维人员。

监控告警的优势：

优势	说明
及时发现问题	快速发现系统问题
快速响应	快速响应系统问题
降低损失	降低系统故障损失
提高效率	提高运维效率
自动化	自动化告警处理

1.2 监控告警流程

监控告警包含多个阶段。

监控告警流程：

数据采集 → 数据存储 → 规则评估 → 告警触发 → 告警分组 → 告警去重 → 告警抑制 → 告警通知 → 告警处理 → 告警恢复
    ↓         ↓          ↓          ↓          ↓          ↓          ↓          ↓          ↓          ↓
  Exporter  Prometheus  PromQL    Alertmanager  分组策略   去重策略   抑制策略   通知渠道   运维处理   问题解决

2. Prometheus告警配置

2.1 告警规则概述

告警规则用于定义告警条件和告警级别。

告警规则结构：

yaml

groups:
  - name: 告警组名称
    rules:
      - alert: 告警名称
        expr: 告警表达式
        for: 持续时间
        labels:
          标签: 值
        annotations:
          摘要: 告警摘要
          描述: 告警描述

2.2 告警规则示例

告警规则的示例。

示例1：CPU使用率告警

yaml

groups:
  - name: node_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}."

示例2：内存使用率告警

yaml

groups:
  - name: node_alerts
    rules:
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is above 80% for more than 5 minutes on {{ $labels.instance }}."

示例3：磁盘使用率告警

yaml

groups:
  - name: node_alerts
    rules:
      - alert: HighDiskUsage
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 80
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "High disk usage detected"
          description: "Disk usage is above 80% for more than 5 minutes on {{ $labels.instance }}."

示例4：服务宕机告警

yaml

groups:
  - name: service_alerts
    rules:
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
          team: ops
        annotations:
          summary: "Service down detected"
          description: "Service {{ $labels.job }} is down on {{ $labels.instance }}."

2.3 配置告警规则

配置告警规则的步骤。

步骤1：创建告警规则目录

bash

# 创建告警规则目录
sudo mkdir -p /etc/prometheus/alerts

步骤2：创建告警规则文件

bash

# 创建告警规则文件
sudo vim /etc/prometheus/alerts/node.yml

步骤3：配置Prometheus

bash

# 编辑Prometheus配置文件
sudo vim /etc/prometheus/prometheus.yml

配置文件内容：

yaml

# 告警规则文件
rule_files:
  - "alerts/*.yml"

步骤4：重启Prometheus

bash

# 重启Prometheus
sudo systemctl restart prometheus

3. Alertmanager配置

3.1 Alertmanager概述

Alertmanager是Prometheus的告警管理器，负责告警的分组、去重、抑制和通知。

Alertmanager的功能：

功能	说明
告警分组	将相关告警分组
告警去重	去除重复告警
告警抑制	抑制相关告警
告警通知	发送告警通知

3.2 Alertmanager安装

安装Alertmanager的步骤。

步骤1：下载Alertmanager

bash

# 下载Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz

# 解压Alertmanager
tar -xzf alertmanager-0.26.0.linux-amd64.tar.gz

# 移动到安装目录
sudo mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager

步骤2：创建Alertmanager用户

bash

# 创建Alertmanager用户
sudo useradd -m -s /bin/bash alertmanager

# 设置Alertmanager用户密码
sudo passwd alertmanager

步骤3：配置Alertmanager

bash

# 创建Alertmanager配置目录
sudo mkdir -p /etc/alertmanager

# 复制配置文件
sudo cp /opt/alertmanager/alertmanager.yml /etc/alertmanager/

# 修改配置文件权限
sudo chown -R alertmanager:alertmanager /etc/alertmanager

步骤4：创建Alertmanager服务

bash

# 创建Alertmanager服务文件
sudo vim /etc/systemd/system/alertmanager.service

服务文件内容：

ini

[Unit]
Description=Alertmanager
After=network.target

[Service]
User=alertmanager
Group=alertmanager
ExecStart=/opt/alertmanager/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager

[Install]
WantedBy=multi-user.target

步骤5：启动Alertmanager

bash

# 创建Alertmanager数据目录
sudo mkdir -p /var/lib/alertmanager
sudo chown -R alertmanager:alertmanager /var/lib/alertmanager

# 重载systemd
sudo systemctl daemon-reload

# 启动Alertmanager
sudo systemctl start alertmanager

# 设置Alertmanager开机自启
sudo systemctl enable alertmanager

# 查看Alertmanager状态
sudo systemctl status alertmanager

步骤6：访问Alertmanager

打开浏览器，访问：http://localhost:9093

3.3 Alertmanager配置

Alertmanager配置用于设置告警的分组、去重、抑制和通知。

配置文件结构：

yaml

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: 'admin@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager@example.com'
        auth_password: 'password'

3.4 告警分组配置

告警分组配置用于将相关告警分组。

告警分组配置示例：

yaml

route:
  # 分组标签
  group_by: ['alertname', 'cluster', 'service']
  
  # 等待时间
  group_wait: 10s
  
  # 分组间隔
  group_interval: 10s
  
  # 重复间隔
  repeat_interval: 12h
  
  # 默认接收器
  receiver: 'default'
  
  # 路由规则
  routes:
    - match:
        severity: critical
      receiver: 'critical'
    
    - match:
        severity: warning
      receiver: 'warning'

3.5 告警抑制配置

告警抑制配置用于抑制相关告警。

告警抑制配置示例：

yaml

inhibit_rules:
  # 抑制规则
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

4. 告警通知配置

4.1 邮件通知

邮件通知是最常用的告警通知方式。

邮件通知配置：

yaml

receivers:
  - name: 'default'
    email_configs:
      - to: 'admin@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager@example.com'
        auth_password: 'password'
        require_tls: true

4.2 钉钉通知

钉钉通知是常用的告警通知方式。

钉钉通知配置：

yaml

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
        send_resolved: true

钉钉通知脚本：

bash

#!/bin/bash
# 钉钉通知脚本

# 配置
WEBHOOK_URL="https://oapi.dingtalk.com/robot/send?access_token=xxx"
MESSAGE="$1"

# 发送通知
curl -X POST "$WEBHOOK_URL" \
  -H 'Content-Type: application/json' \
  -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"$MESSAGE\"}}"

4.3 企业微信通知

企业微信通知是常用的告警通知方式。

企业微信通知配置：

yaml

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx'
        send_resolved: true

企业微信通知脚本：

bash

#!/bin/bash
# 企业微信通知脚本

# 配置
WEBHOOK_URL="https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx"
MESSAGE="$1"

# 发送通知
curl -X POST "$WEBHOOK_URL" \
  -H 'Content-Type: application/json' \
  -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"$MESSAGE\"}}"

4.4 Slack通知

Slack通知是常用的告警通知方式。

Slack通知配置：

yaml

receivers:
  - name: 'default'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts'
        send_resolved: true

5. 告警规则最佳实践

5.1 告警级别

告警级别用于区分告警的严重程度。

告警级别：

级别	说明	响应时间
critical	严重告警	5分钟
warning	警告告警	30分钟
info	信息告警	1小时

5.2 告警频率

告警频率用于控制告警的发送频率。

告警频率：

告警类型	频率
critical	立即发送
warning	5分钟
info	1小时

5.3 告警内容

告警内容用于描述告警的详细信息。

告警内容：

告警名称：HighCPUUsage
告警级别：warning
告警时间：2024-01-01 10:00:00
告警主机：192.168.1.10
告警描述：CPU使用率超过80%，持续5分钟

6. 实战案例

案例1：主机监控告警

场景：配置主机监控告警。

告警规则：

yaml

groups:
  - name: node_alerts
    rules:
      # CPU使用率告警
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}."
      
      # 内存使用率告警
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is above 80% for more than 5 minutes on {{ $labels.instance }}."
      
      # 磁盘使用率告警
      - alert: HighDiskUsage
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 80
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "High disk usage detected"
          description: "Disk usage is above 80% for more than 5 minutes on {{ $labels.instance }}."

案例2：应用监控告警

场景：配置应用监控告警。

告警规则：

yaml

groups:
  - name: app_alerts
    rules:
      # 服务宕机告警
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
          team: ops
        annotations:
          summary: "Service down detected"
          description: "Service {{ $labels.job }} is down on {{ $labels.instance }}."
      
      # 请求错误率告警
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
          team: ops
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 5% for more than 5 minutes on {{ $labels.instance }}."
      
      # 响应时间告警
      - alert: HighResponseTime
        expr: rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]) > 1
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "High response time detected"
          description: "Response time is above 1s for more than 5 minutes on {{ $labels.instance }}."

课程总结

这节课我们学习了监控告警配置。

核心内容:

监控告警概述
Prometheus告警配置
Alertmanager配置
告警通知配置
告警规则最佳实践
实战案例

重要概念:

告警规则：定义告警条件和告警级别
Alertmanager：告警管理器，负责告警的分组、去重、抑制和通知
告警分组：将相关告警分组
告警去重：去除重复告警
告警抑制：抑制相关告警
告警通知：发送告警通知

监控告警是监控系统的重要组成部分，掌握这些知识后，我们将在后续课程中学习SSH安全配置、防火墙高级配置等内容。

课后练习

练习1（基础）

配置Prometheus告警规则，实现CPU使用率告警。

练习2（进阶）

安装Alertmanager，并配置邮件通知。

练习3（拓展）

配置钉钉通知，实现告警通知到钉钉群。

监控告警配置 ​

课程介绍 ​

1. 监控告警概述 ​

1.1 什么是监控告警 ​

1.2 监控告警流程 ​

2. Prometheus告警配置 ​

2.1 告警规则概述 ​

2.2 告警规则示例 ​

2.3 配置告警规则 ​

3. Alertmanager配置 ​

3.1 Alertmanager概述 ​

3.2 Alertmanager安装 ​

3.3 Alertmanager配置 ​

3.4 告警分组配置 ​

3.5 告警抑制配置 ​

4. 告警通知配置 ​

4.1 邮件通知 ​

4.2 钉钉通知 ​

4.3 企业微信通知 ​

4.4 Slack通知 ​

5. 告警规则最佳实践 ​

5.1 告警级别 ​

5.2 告警频率 ​

5.3 告警内容 ​

6. 实战案例 ​

案例1：主机监控告警 ​

案例2：应用监控告警 ​

课程总结 ​

课后练习 ​

练习1（基础） ​

练习2（进阶） ​

练习3（拓展） ​

评论区

监控告警配置

课程介绍

1. 监控告警概述

1.1 什么是监控告警

1.2 监控告警流程

2. Prometheus告警配置

2.1 告警规则概述

2.2 告警规则示例

2.3 配置告警规则

3. Alertmanager配置

3.1 Alertmanager概述

3.2 Alertmanager安装

3.3 Alertmanager配置

3.4 告警分组配置

3.5 告警抑制配置

4. 告警通知配置

4.1 邮件通知

4.2 钉钉通知

4.3 企业微信通知

4.4 Slack通知

5. 告警规则最佳实践

5.1 告警级别

5.2 告警频率

5.3 告警内容

6. 实战案例

案例1：主机监控告警

案例2：应用监控告警

课程总结

课后练习

练习1（基础）

练习2（进阶）

练习3（拓展）