自定义脚本监控

课程目标

了解自定义脚本监控的重要性和应用场景
掌握不同类型脚本监控的开发方法
学会使用各种监控工具集成自定义脚本
理解监控脚本的部署和管理方法

1. 自定义脚本监控概述

1.1 什么是自定义脚本监控

自定义脚本监控是指通过编写脚本（如Shell、Python、Go等）来监控系统、应用或服务的状态，并将监控结果集成到监控系统中的一种监控方式。它可以根据特定的监控需求，灵活地实现各种监控逻辑。

1.2 自定义脚本监控的应用场景

系统监控：监控CPU、内存、磁盘、网络等系统指标
应用监控：监控应用的运行状态、响应时间、错误率等
服务监控：监控各种服务的运行状态和性能
业务监控：监控业务指标、用户行为等
自定义指标监控：监控特定业务场景的指标

1.3 自定义脚本监控的优势

灵活性：可以根据具体需求编写监控逻辑
扩展性：可以监控各种系统、应用和服务
成本低：使用脚本语言开发，开发成本低
易于集成：可以与各种监控系统集成
实时性：可以实现实时监控

2. 监控脚本开发基础

2.1 脚本语言选择

语言	优势	劣势	适用场景
Shell	简单易用，系统内置	功能有限，处理复杂逻辑较困难	简单的系统监控
Python	功能强大，库丰富	性能一般	复杂的监控逻辑，需要处理各种数据
Go	性能优异，并发能力强	学习曲线较陡	高性能监控，需要处理大量数据
PowerShell	Windows系统原生支持	跨平台性差	Windows系统监控

2.2 监控脚本的基本结构

bash

#!/bin/bash

# 监控脚本示例

# 脚本名称
SCRIPT_NAME="system_monitor.sh"

# 日志文件
LOG_FILE="/var/log/system_monitor.log"

# 监控阈值
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90

# 监控函数
function monitor_cpu() {
    # 监控CPU使用率
    cpu_usage=$(top -bn1 | grep "%Cpu(s)" | awk '{print 100 - $8}')
    echo "CPU使用率: ${cpu_usage}%"
    
    if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
        echo "警告: CPU使用率超过阈值 ${CPU_THRESHOLD}%"
        # 发送告警
        send_alert "CPU使用率告警" "CPU使用率: ${cpu_usage}%"
    fi
}

function monitor_memory() {
    # 监控内存使用率
    memory_total=$(free -m | grep Mem: | awk '{print $2}')
    memory_used=$(free -m | grep Mem: | awk '{print $3}')
    memory_usage=$(echo "scale=2; $memory_used / $memory_total * 100" | bc)
    echo "内存使用率: ${memory_usage}%"
    
    if (( $(echo "$memory_usage > $MEMORY_THRESHOLD" | bc -l) )); then
        echo "警告: 内存使用率超过阈值 ${MEMORY_THRESHOLD}%"
        # 发送告警
        send_alert "内存使用率告警" "内存使用率: ${memory_usage}%"
    fi
}

function monitor_disk() {
    # 监控磁盘使用率
    disk_usage=$(df -h | grep '/$' | awk '{print $5}' | sed 's/%//')
    echo "磁盘使用率: ${disk_usage}%"
    
    if [ $disk_usage -gt $DISK_THRESHOLD ]; then
        echo "警告: 磁盘使用率超过阈值 ${DISK_THRESHOLD}%"
        # 发送告警
        send_alert "磁盘使用率告警" "磁盘使用率: ${disk_usage}%"
    fi
}

function send_alert() {
    # 发送告警
    local subject=$1
    local message=$2
    echo "$(date '+%Y-%m-%d %H:%M:%S') - ${subject}: ${message}" >> $LOG_FILE
    # 可以添加邮件告警、短信告警等
}

# 主函数
function main() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - 开始监控" >> $LOG_FILE
    monitor_cpu
    monitor_memory
    monitor_disk
    echo "$(date '+%Y-%m-%d %H:%M:%S') - 监控结束" >> $LOG_FILE
}

# 执行主函数
main

2.3 监控数据的输出格式

文本格式：简单的文本输出，易于阅读
JSON格式：结构化的输出，易于解析
键值对格式：便于监控系统处理
特定监控系统格式：如Prometheus格式、Zabbix格式等

3. 系统监控脚本开发

3.1 CPU监控脚本

bash

#!/bin/bash

# CPU监控脚本

# 获取CPU使用率
function get_cpu_usage() {
    # 使用top命令获取CPU使用率
    cpu_usage=$(top -bn1 | grep "%Cpu(s)" | awk '{print 100 - $8}')
    echo "$cpu_usage"
}

# 获取每个CPU核心的使用率
function get_cpu_core_usage() {
    # 使用mpstat命令获取每个核心的使用率
    mpstat -P ALL 1 1 | grep -A 1 "CPU" | tail -n +2 | awk '{print $1 ": " 100 - $13}'
}

# 主函数
function main() {
    echo "=== CPU监控 ==="
    echo "总体CPU使用率: $(get_cpu_usage)%"
    echo "每个核心的使用率:"
    get_cpu_core_usage
}

# 执行主函数
main

3.2 内存监控脚本

bash

#!/bin/bash

# 内存监控脚本

# 获取内存使用情况
function get_memory_usage() {
    # 使用free命令获取内存使用情况
    free -m | grep Mem:
    
    # 计算内存使用率
    memory_total=$(free -m | grep Mem: | awk '{print $2}')
    memory_used=$(free -m | grep Mem: | awk '{print $3}')
    memory_free=$(free -m | grep Mem: | awk '{print $4}')
    memory_usage=$(echo "scale=2; $memory_used / $memory_total * 100" | bc)
    
    echo "内存使用率: ${memory_usage}%"
    echo "总内存: ${memory_total}MB"
    echo "已用内存: ${memory_used}MB"
    echo "可用内存: ${memory_free}MB"
}

# 获取交换分区使用情况
function get_swap_usage() {
    # 使用free命令获取交换分区使用情况
    free -m | grep Swap:
    
    # 计算交换分区使用率
    swap_total=$(free -m | grep Swap: | awk '{print $2}')
    swap_used=$(free -m | grep Swap: | awk '{print $3}')
    swap_free=$(free -m | grep Swap: | awk '{print $4}')
    
    if [ $swap_total -gt 0 ]; then
        swap_usage=$(echo "scale=2; $swap_used / $swap_total * 100" | bc)
        echo "交换分区使用率: ${swap_usage}%"
    else
        echo "交换分区未启用"
    fi
}

# 主函数
function main() {
    echo "=== 内存监控 ==="
    get_memory_usage
    echo ""
    echo "=== 交换分区监控 ==="
    get_swap_usage
}

# 执行主函数
main

3.3 磁盘监控脚本

bash

#!/bin/bash

# 磁盘监控脚本

# 获取磁盘使用情况
function get_disk_usage() {
    # 使用df命令获取磁盘使用情况
    df -h
}

# 获取特定挂载点的使用情况
function get_mount_usage() {
    local mount_point=$1
    
    if [ -z "$mount_point" ]; then
        echo "请指定挂载点"
        return 1
    fi
    
    # 检查挂载点是否存在
    if ! mountpoint -q "$mount_point"; then
        echo "挂载点 $mount_point 不存在"
        return 1
    fi
    
    # 获取特定挂载点的使用情况
    df -h | grep "$mount_point"
    
    # 获取该挂载点的inode使用情况
    echo ""
    echo "inode使用情况:"
    df -i | grep "$mount_point"
}

# 获取磁盘I/O情况
function get_disk_io() {
    # 使用iostat命令获取磁盘I/O情况
    iostat -x
}

# 主函数
function main() {
    echo "=== 磁盘使用情况 ==="
    get_disk_usage
    echo ""
    echo "=== 根分区使用情况 ==="
    get_mount_usage "/"
    echo ""
    echo "=== 磁盘I/O情况 ==="
    get_disk_io
}

# 执行主函数
main

3.4 网络监控脚本

bash

#!/bin/bash

# 网络监控脚本

# 获取网络接口信息
function get_network_interfaces() {
    # 使用ifconfig命令获取网络接口信息
    ifconfig
}

# 获取网络连接状态
function get_network_connections() {
    # 使用netstat命令获取网络连接状态
    netstat -tuln
}

# 获取网络流量
function get_network_traffic() {
    # 使用vnstat命令获取网络流量（需要安装vnstat）
    if command -v vnstat &> /dev/null; then
        vnstat
    else
        echo "vnstat命令未安装，请使用 apt install vnstat 或 yum install vnstat 安装"
        
        # 使用ifconfig命令获取网络流量
        echo ""
        echo "使用ifconfig获取网络流量:"
        ifconfig | grep -E "RX packets|TX packets"
    fi
}

# 测试网络连通性
function test_network_connectivity() {
    local host=$1
    
    if [ -z "$host" ]; then
        host="google.com"
    fi
    
    echo "测试与 $host 的连通性:"
    ping -c 5 $host
}

# 主函数
function main() {
    echo "=== 网络接口信息 ==="
    get_network_interfaces
    echo ""
    echo "=== 网络连接状态 ==="
    get_network_connections
    echo ""
    echo "=== 网络流量 ==="
    get_network_traffic
    echo ""
    echo "=== 网络连通性测试 ==="
    test_network_connectivity
}

# 执行主函数
main

4. 应用监控脚本开发

4.1 服务状态监控脚本

bash

#!/bin/bash

# 服务状态监控脚本

# 检查服务状态
function check_service_status() {
    local service_name=$1
    
    if [ -z "$service_name" ]; then
        echo "请指定服务名称"
        return 1
    fi
    
    # 使用systemctl命令检查服务状态
    if command -v systemctl &> /dev/null; then
        systemctl status $service_name
    elif command -v service &> /dev/null; then
        service $service_name status
    else
        echo "无法检查服务状态，系统不支持systemctl或service命令"
        return 1
    fi
}

# 检查多个服务状态
function check_multiple_services() {
    local services=($@)
    
    for service in "${services[@]}"; do
        echo "=== 检查服务: $service ==="
        check_service_status $service
        echo ""
    done
}

# 检查服务是否运行
function is_service_running() {
    local service_name=$1
    
    if [ -z "$service_name" ]; then
        echo "请指定服务名称"
        return 1
    fi
    
    # 使用systemctl命令检查服务是否运行
    if command -v systemctl &> /dev/null; then
        systemctl is-active --quiet $service_name
        return $?
    elif command -v service &> /dev/null; then
        service $service_name status > /dev/null 2>&1
        return $?
    else
        echo "无法检查服务状态，系统不支持systemctl或service命令"
        return 1
    fi
}

# 主函数
function main() {
    # 检查常见服务的状态
    services=("ssh" "nginx" "mysql" "redis")
    check_multiple_services "${services[@]}"
    
    # 检查特定服务是否运行
    echo "=== 服务运行状态检查 ==="
    for service in "${services[@]}"; do
        if is_service_running $service; then
            echo "服务 $service 正在运行"
        else
            echo "服务 $service 未运行"
        fi
    done
}

# 执行主函数
main

4.2 Web应用监控脚本

python

#!/usr/bin/env python3

# Web应用监控脚本

import requests
import time
import json

# 检查Web应用状态
def check_web_status(url):
    """检查Web应用的状态"""
    try:
        start_time = time.time()
        response = requests.get(url, timeout=10)
        end_time = time.time()
        
        response_time = end_time - start_time
        
        status = {
            "url": url,
            "status_code": response.status_code,
            "response_time": round(response_time, 3),
            "status": "up" if response.status_code < 400 else "down",
            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
        }
        
        return status
    except Exception as e:
        status = {
            "url": url,
            "status_code": 0,
            "response_time": 0,
            "status": "down",
            "error": str(e),
            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
        }
        return status

# 检查多个Web应用
def check_multiple_webs(urls):
    """检查多个Web应用的状态"""
    results = []
    
    for url in urls:
        result = check_web_status(url)
        results.append(result)
    
    return results

# 发送告警
def send_alert(status):
    """发送告警"""
    if status["status"] == "down":
        print(f"告警: Web应用 {status['url']} 不可用")
        print(f"状态码: {status['status_code']}")
        if "error" in status:
            print(f"错误信息: {status['error']}")

# 主函数
def main():
    # 要监控的Web应用
    urls = [
        "https://www.google.com",
        "https://www.baidu.com",
        "https://www.github.com"
    ]
    
    print("=== Web应用监控 ===")
    results = check_multiple_webs(urls)
    
    for result in results:
        print(json.dumps(result, indent=2, ensure_ascii=False))
        print("-")
        # 发送告警
        send_alert(result)

if __name__ == "__main__":
    main()

4.3 数据库监控脚本

python

#!/usr/bin/env python3

# 数据库监控脚本

import pymysql
import psycopg2
import time

# 监控MySQL数据库
def monitor_mysql(host, port, user, password, database):
    """监控MySQL数据库"""
    try:
        start_time = time.time()
        conn = pymysql.connect(
            host=host,
            port=port,
            user=user,
            password=password,
            database=database
        )
        end_time = time.time()
        
        connection_time = end_time - start_time
        
        # 获取数据库状态
        with conn.cursor() as cursor:
            # 获取数据库版本
            cursor.execute("SELECT VERSION()")
            version = cursor.fetchone()[0]
            
            # 获取数据库连接数
            cursor.execute("SHOW STATUS LIKE 'Threads_connected'")
            connections = cursor.fetchone()[1]
            
            # 获取数据库QPS
            cursor.execute("SHOW STATUS LIKE 'Queries'")
            queries = cursor.fetchone()[1]
            
        conn.close()
        
        status = {
            "database": "mysql",
            "host": host,
            "port": port,
            "status": "up",
            "version": version,
            "connection_time": round(connection_time, 3),
            "connections": connections,
            "queries": queries,
            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
        }
        
        return status
    except Exception as e:
        status = {
            "database": "mysql",
            "host": host,
            "port": port,
            "status": "down",
            "error": str(e),
            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
        }
        return status

# 监控PostgreSQL数据库
def monitor_postgresql(host, port, user, password, database):
    """监控PostgreSQL数据库"""
    try:
        start_time = time.time()
        conn = psycopg2.connect(
            host=host,
            port=port,
            user=user,
            password=password,
            database=database
        )
        end_time = time.time()
        
        connection_time = end_time - start_time
        
        # 获取数据库状态
        with conn.cursor() as cursor:
            # 获取数据库版本
            cursor.execute("SELECT version()")
            version = cursor.fetchone()[0]
            
            # 获取数据库连接数
            cursor.execute("SELECT count(*) FROM pg_stat_activity")
            connections = cursor.fetchone()[0]
            
            # 获取数据库查询数
            cursor.execute("SELECT pg_stat_database.numbackends FROM pg_stat_database WHERE pg_stat_database.datname = %s", (database,))
            backends = cursor.fetchone()[0]
        
        conn.close()
        
        status = {
            "database": "postgresql",
            "host": host,
            "port": port,
            "status": "up",
            "version": version,
            "connection_time": round(connection_time, 3),
            "connections": connections,
            "backends": backends,
            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
        }
        
        return status
    except Exception as e:
        status = {
            "database": "postgresql",
            "host": host,
            "port": port,
            "status": "down",
            "error": str(e),
            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
        }
        return status

# 主函数
def main():
    print("=== 数据库监控 ===")
    
    # 监控MySQL数据库
    mysql_status = monitor_mysql(
        host="localhost",
        port=3306,
        user="root",
        password="password",
        database="mysql"
    )
    print("MySQL监控结果:")
    for key, value in mysql_status.items():
        print(f"{key}: {value}")
    print("-")
    
    # 监控PostgreSQL数据库
    pg_status = monitor_postgresql(
        host="localhost",
        port=5432,
        user="postgres",
        password="password",
        database="postgres"
    )
    print("PostgreSQL监控结果:")
    for key, value in pg_status.items():
        print(f"{key}: {value}")

if __name__ == "__main__":
    main()

4. 监控工具集成

4.1 与Prometheus集成

bash

#!/bin/bash

# Prometheus格式的监控脚本

# 输出Prometheus格式的指标
function output_prometheus_metrics() {
    # 输出CPU使用率
    cpu_usage=$(top -bn1 | grep "%Cpu(s)" | awk '{print 100 - $8}')
    echo "# HELP system_cpu_usage_percent CPU usage percentage"
    echo "# TYPE system_cpu_usage_percent gauge"
    echo "system_cpu_usage_percent $cpu_usage"
    echo ""
    
    # 输出内存使用率
    memory_total=$(free -m | grep Mem: | awk '{print $2}')
    memory_used=$(free -m | grep Mem: | awk '{print $3}')
    memory_usage=$(echo "scale=2; $memory_used / $memory_total * 100" | bc)
    echo "# HELP system_memory_usage_percent Memory usage percentage"
    echo "# TYPE system_memory_usage_percent gauge"
    echo "system_memory_usage_percent $memory_usage"
    echo ""
    
    # 输出磁盘使用率
    disk_usage=$(df -h | grep '/$' | awk '{print $5}' | sed 's/%//')
    echo "# HELP system_disk_usage_percent Disk usage percentage"
    echo "# TYPE system_disk_usage_percent gauge"
    echo "system_disk_usage_percent $disk_usage"
}

# 主函数
function main() {
    output_prometheus_metrics
}

# 执行主函数
main

4.2 与Zabbix集成

bash

#!/bin/bash

# Zabbix集成监控脚本

# 监控项
METRIC=$1

# 获取CPU使用率
if [ "$METRIC" == "cpu_usage" ]; then
    cpu_usage=$(top -bn1 | grep "%Cpu(s)" | awk '{print 100 - $8}')
    echo "$cpu_usage"
# 获取内存使用率
elif [ "$METRIC" == "memory_usage" ]; then
    memory_total=$(free -m | grep Mem: | awk '{print $2}')
    memory_used=$(free -m | grep Mem: | awk '{print $3}')
    memory_usage=$(echo "scale=2; $memory_used / $memory_total * 100" | bc)
    echo "$memory_usage"
# 获取磁盘使用率
elif [ "$METRIC" == "disk_usage" ]; then
    disk_usage=$(df -h | grep '/$' | awk '{print $5}' | sed 's/%//')
    echo "$disk_usage"
# 获取网络流量
elif [ "$METRIC" == "network_rx" ]; then
    network_rx=$(ifconfig eth0 | grep "RX packets" | awk '{print $5}')
    echo "$network_rx"
elif [ "$METRIC" == "network_tx" ]; then
    network_tx=$(ifconfig eth0 | grep "TX packets" | awk '{print $5}')
    echo "$network_tx"
else
    echo "未知的监控项"
    exit 1
fi

4.3 与Nagios/Icinga集成

bash

#!/bin/bash

# Nagios/Icinga集成监控脚本

# 监控阈值
WARNING_THRESHOLD=80
CRITICAL_THRESHOLD=90

# 获取CPU使用率
cpu_usage=$(top -bn1 | grep "%Cpu(s)" | awk '{print 100 - $8}')
cpu_usage_int=$(echo "$cpu_usage" | awk '{print int($1)}')

# 检查CPU使用率
if [ $cpu_usage_int -ge $CRITICAL_THRESHOLD ]; then
    echo "CRITICAL - CPU使用率: $cpu_usage%"
    exit 2
elif [ $cpu_usage_int -ge $WARNING_THRESHOLD ]; then
    echo "WARNING - CPU使用率: $cpu_usage%"
    exit 1
else
    echo "OK - CPU使用率: $cpu_usage%"
    exit 0
fi

5. 监控脚本的部署和管理

5.1 监控脚本的部署

本地部署：
- 将脚本复制到监控目标服务器
- 设置脚本的执行权限
- 配置监控系统调用脚本
集中部署：
- 在监控服务器上部署脚本
- 通过SSH等方式远程执行脚本
- 收集监控结果
容器化部署：
- 将监控脚本打包到Docker容器中
- 运行容器并配置监控

5.2 监控脚本的调度

Cron定时任务：

bash

# 添加Cron定时任务
crontab -e

# 每5分钟执行一次监控脚本
*/5 * * * * /path/to/monitor_script.sh >> /var/log/monitor.log 2>&1

Systemd定时器：

bash

# 创建systemd服务文件
cat > /etc/systemd/system/monitor.service << 'EOF'
[Unit]
Description=System Monitor Service

[Service]
Type=oneshot
ExecStart=/path/to/monitor_script.sh
EOF

# 创建systemd定时器文件
cat > /etc/systemd/system/monitor.timer << 'EOF'
[Unit]
Description=Run monitor every 5 minutes

[Timer]
OnBootSec=1min
OnUnitActiveSec=5min

[Install]
WantedBy=timers.target
EOF

# 启用并启动定时器
systemctl daemon-reload
systemctl enable monitor.timer
systemctl start monitor.timer

监控系统调度：
- 在Prometheus中配置抓取间隔
- 在Zabbix中配置监控项的更新间隔
- 在Nagios中配置检查间隔

5.3 监控脚本的管理

版本控制：
- 使用Git等版本控制系统管理脚本
- 记录脚本的变更历史
配置管理：
- 使用配置文件管理脚本的配置
- 支持不同环境的配置
日志管理：
- 记录脚本的执行日志
- 配置日志轮转
错误处理：
- 适当处理脚本执行过程中的错误
- 发送错误告警

6. 实战案例：服务器综合监控脚本

6.1 功能需求

监控服务器的CPU、内存、磁盘、网络等系统指标
监控服务器上的关键服务状态
监控服务器的负载情况
当指标超过阈值时发送告警
支持与Prometheus集成

6.2 监控脚本实现

bash

#!/bin/bash

# 服务器综合监控脚本

# 配置文件
CONFIG_FILE="/etc/monitor.conf"

# 默认配置
DEFAULT_CONFIG="
# 监控阈值
CPU_WARNING=80
CPU_CRITICAL=90
MEMORY_WARNING=85
MEMORY_CRITICAL=95
DISK_WARNING=85
DISK_CRITICAL=95
LOAD_WARNING=2.0
LOAD_CRITICAL=4.0

# 监控的服务
SERVICES="ssh nginx mysql redis"

# 告警设置
ALERT_ENABLED=true
ALERT_EMAIL="admin@example.com"

# 日志设置
LOG_FILE="/var/log/server_monitor.log"
LOG_LEVEL="info"
"

# 加载配置
function load_config() {
    if [ -f "$CONFIG_FILE" ]; then
        source "$CONFIG_FILE"
    else
        # 创建默认配置文件
        echo "$DEFAULT_CONFIG" > "$CONFIG_FILE"
        source "$CONFIG_FILE"
    fi
}

# 记录日志
function log() {
    local level=$1
    local message=$2
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    
    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
    
    # 输出到控制台
    if [ "$LOG_LEVEL" == "debug" ] || [ "$LOG_LEVEL" == "$level" ]; then
        echo "[$timestamp] [$level] $message"
    fi
}

# 发送告警
function send_alert() {
    local level=$1
    local subject=$2
    local message=$3
    
    if [ "$ALERT_ENABLED" == "true" ]; then
        log "alert" "$subject: $message"
        
        # 发送邮件告警（需要安装mailx）
        if command -v mailx &> /dev/null; then
            echo "$message" | mailx -s "[$level] $subject" "$ALERT_EMAIL"
        fi
    fi
}

# 监控CPU
function monitor_cpu() {
    log "info" "开始监控CPU"
    
    cpu_usage=$(top -bn1 | grep "%Cpu(s)" | awk '{print 100 - $8}')
    cpu_usage_int=$(echo "$cpu_usage" | awk '{print int($1)}')
    
    log "info" "CPU使用率: $cpu_usage%"
    
    if [ $cpu_usage_int -ge $CPU_CRITICAL ]; then
        send_alert "CRITICAL" "CPU使用率告警" "CPU使用率: $cpu_usage%，超过临界阈值 $CPU_CRITICAL%"
        return 2
    elif [ $cpu_usage_int -ge $CPU_WARNING ]; then
        send_alert "WARNING" "CPU使用率告警" "CPU使用率: $cpu_usage%，超过警告阈值 $CPU_WARNING%"
        return 1
    else
        return 0
    fi
}

# 监控内存
function monitor_memory() {
    log "info" "开始监控内存"
    
    memory_total=$(free -m | grep Mem: | awk '{print $2}')
    memory_used=$(free -m | grep Mem: | awk '{print $3}')
    memory_usage=$(echo "scale=2; $memory_used / $memory_total * 100" | bc)
    memory_usage_int=$(echo "$memory_usage" | awk '{print int($1)}')
    
    log "info" "内存使用率: $memory_usage%"
    
    if [ $memory_usage_int -ge $MEMORY_CRITICAL ]; then
        send_alert "CRITICAL" "内存使用率告警" "内存使用率: $memory_usage%，超过临界阈值 $MEMORY_CRITICAL%"
        return 2
    elif [ $memory_usage_int -ge $MEMORY_WARNING ]; then
        send_alert "WARNING" "内存使用率告警" "内存使用率: $memory_usage%，超过警告阈值 $MEMORY_WARNING%"
        return 1
    else
        return 0
    fi
}

# 监控磁盘
function monitor_disk() {
    log "info" "开始监控磁盘"
    
    disk_usage=$(df -h | grep '/$' | awk '{print $5}' | sed 's/%//')
    
    log "info" "磁盘使用率: $disk_usage%"
    
    if [ $disk_usage -ge $DISK_CRITICAL ]; then
        send_alert "CRITICAL" "磁盘使用率告警" "磁盘使用率: $disk_usage%，超过临界阈值 $DISK_CRITICAL%"
        return 2
    elif [ $disk_usage -ge $DISK_WARNING ]; then
        send_alert "WARNING" "磁盘使用率告警" "磁盘使用率: $disk_usage%，超过警告阈值 $DISK_WARNING%"
        return 1
    else
        return 0
    fi
}

# 监控负载
function monitor_load() {
    log "info" "开始监控负载"
    
    load=$(uptime | awk '{print $10}' | sed 's/,//')
    load_float=$(echo "$load" | awk '{print $1 + 0}')
    
    log "info" "系统负载: $load"
    
    if (( $(echo "$load_float >= $LOAD_CRITICAL" | bc -l) )); then
        send_alert "CRITICAL" "系统负载告警" "系统负载: $load，超过临界阈值 $LOAD_CRITICAL"
        return 2
    elif (( $(echo "$load_float >= $LOAD_WARNING" | bc -l) )); then
        send_alert "WARNING" "系统负载告警" "系统负载: $load，超过警告阈值 $LOAD_WARNING"
        return 1
    else
        return 0
    fi
}

# 监控服务
function monitor_services() {
    log "info" "开始监控服务"
    
    local services=($SERVICES)
    local down_services=()
    
    for service in "${services[@]}"; do
        if command -v systemctl &> /dev/null; then
            systemctl is-active --quiet "$service"
            status=$?
        elif command -v service &> /dev/null; then
            service "$service" status > /dev/null 2>&1
            status=$?
        else
            log "error" "无法检查服务状态，系统不支持systemctl或service命令"
            return 1
        fi
        
        if [ $status -ne 0 ]; then
            down_services+=("$service")
        else
            log "info" "服务 $service 运行正常"
        fi
    done
    
    if [ ${#down_services[@]} -gt 0 ]; then
        send_alert "CRITICAL" "服务状态告警" "以下服务未运行: ${down_services[*]}"
        return 2
    else
        return 0
    fi
}

# 输出Prometheus格式的指标
function output_prometheus_metrics() {
    # 输出CPU使用率
    cpu_usage=$(top -bn1 | grep "%Cpu(s)" | awk '{print 100 - $8}')
    echo "# HELP server_cpu_usage_percent CPU usage percentage"
    echo "# TYPE server_cpu_usage_percent gauge"
    echo "server_cpu_usage_percent $cpu_usage"
    echo ""
    
    # 输出内存使用率
    memory_total=$(free -m | grep Mem: | awk '{print $2}')
    memory_used=$(free -m | grep Mem: | awk '{print $3}')
    memory_usage=$(echo "scale=2; $memory_used / $memory_total * 100" | bc)
    echo "# HELP server_memory_usage_percent Memory usage percentage"
    echo "# TYPE server_memory_usage_percent gauge"
    echo "server_memory_usage_percent $memory_usage"
    echo ""
    
    # 输出磁盘使用率
    disk_usage=$(df -h | grep '/$' | awk '{print $5}' | sed 's/%//')
    echo "# HELP server_disk_usage_percent Disk usage percentage"
    echo "# TYPE server_disk_usage_percent gauge"
    echo "server_disk_usage_percent $disk_usage"
    echo ""
    
    # 输出系统负载
    load=$(uptime | awk '{print $10}' | sed 's/,//')
    echo "# HELP server_system_load System load"
    echo "# TYPE server_system_load gauge"
    echo "server_system_load $load"
    echo ""
    
    # 输出服务状态
    local services=($SERVICES)
    for service in "${services[@]}"; do
        if command -v systemctl &> /dev/null; then
            systemctl is-active --quiet "$service"
            status=$?
        elif command -v service &> /dev/null; then
            service "$service" status > /dev/null 2>&1
            status=$?
        else
            status=1
        fi
        
        service_status=$((status == 0 ? 1 : 0))
        echo "# HELP server_service_status Service status (1=running, 0=stopped)"
        echo "# TYPE server_service_status gauge"
        echo "server_service_status{service=\"$service\"} $service_status"
    done
}

# 主函数
function main() {
    # 加载配置
    load_config
    
    log "info" "开始服务器综合监控"
    
    # 检查是否需要输出Prometheus格式的指标
    if [ "$1" == "--prometheus" ]; then
        output_prometheus_metrics
        return 0
    fi
    
    # 执行各项监控
    monitor_cpu
    monitor_memory
    monitor_disk
    monitor_load
    monitor_services
    
    log "info" "服务器综合监控结束"
}

# 执行主函数
main "$@"

6. 监控脚本的最佳实践

6.1 代码规范

命名规范：
- 脚本名称应使用小写字母和下划线
- 函数名称应使用小写字母和下划线
- 变量名称应使用大写字母和下划线
代码结构：
- 使用模块化设计
- 分离监控逻辑和告警逻辑
- 使用函数封装重复的代码
错误处理：
- 适当处理错误，避免脚本崩溃
- 记录错误日志
- 发送错误告警

6.2 性能优化

执行效率：
- 避免使用过于复杂的命令
- 减少命令的执行次数
- 使用缓存减少重复计算
资源使用：
- 避免使用过多的内存
- 避免使用过多的CPU
- 避免产生过多的网络流量
执行时间：
- 控制脚本的执行时间
- 设置合理的超时时间
- 避免长时间运行的脚本

6.3 安全性

权限控制：
- 设置合理的脚本权限
- 避免使用root权限运行脚本
输入验证：
- 验证所有输入参数
- 避免命令注入攻击
信息安全：
- 不要在脚本中硬编码敏感信息
- 不要在日志中记录敏感信息

6.4 可维护性

文档：
- 为脚本添加详细的注释
- 提供使用说明文档
配置管理：
- 使用配置文件管理脚本的配置
- 支持不同环境的配置
版本控制：
- 使用Git等版本控制系统管理脚本
- 记录脚本的变更历史

7. 课程总结

7.1 重点回顾

自定义脚本监控的重要性：掌握自定义脚本监控的优势和应用场景
监控脚本开发：学会开发各种类型的监控脚本
监控工具集成：掌握与各种监控系统的集成方法
监控脚本部署和管理：学会监控脚本的部署、调度和管理
监控脚本最佳实践：遵循监控脚本开发的最佳实践

7.2 实践建议

从简单开始：先开发简单的监控脚本，逐步增加复杂度
测试充分：在不同环境中测试监控脚本的可靠性
持续优化：根据实际使用情况不断优化监控脚本
集成监控系统：将监控脚本集成到监控系统中，实现自动化监控
定期维护：定期检查和更新监控脚本，确保其有效性

7.3 进阶学习

分布式监控：学习分布式监控系统的设计和实现
智能监控：学习使用AI技术进行异常检测和预测
监控数据可视化：学习使用Grafana等工具进行监控数据的可视化
监控告警集成：学习与各种告警系统的集成
监控系统设计：学习设计完整的监控系统架构

通过本课程的学习，你已经掌握了自定义脚本监控的开发方法，可以根据实际需求开发各种监控脚本，实现对系统、应用和服务的有效监控。

自定义脚本监控 ​

课程目标 ​

1. 自定义脚本监控概述 ​

1.1 什么是自定义脚本监控 ​

1.2 自定义脚本监控的应用场景 ​

1.3 自定义脚本监控的优势 ​

2. 监控脚本开发基础 ​

2.1 脚本语言选择 ​

2.2 监控脚本的基本结构 ​

2.3 监控数据的输出格式 ​

3. 系统监控脚本开发 ​

3.1 CPU监控脚本 ​

3.2 内存监控脚本 ​

3.3 磁盘监控脚本 ​

3.4 网络监控脚本 ​

4. 应用监控脚本开发 ​

4.1 服务状态监控脚本 ​

4.2 Web应用监控脚本 ​

4.3 数据库监控脚本 ​

4. 监控工具集成 ​

4.1 与Prometheus集成 ​

4.2 与Zabbix集成 ​

4.3 与Nagios/Icinga集成 ​

5. 监控脚本的部署和管理 ​

5.1 监控脚本的部署 ​

5.2 监控脚本的调度 ​

5.3 监控脚本的管理 ​

6. 实战案例：服务器综合监控脚本 ​

6.1 功能需求 ​

6.2 监控脚本实现 ​

6. 监控脚本的最佳实践 ​

6.1 代码规范 ​

6.2 性能优化 ​

6.3 安全性 ​

6.4 可维护性 ​

7. 课程总结 ​

7.1 重点回顾 ​

7.2 实践建议 ​

7.3 进阶学习 ​

评论区

自定义脚本监控

课程目标

1. 自定义脚本监控概述

1.1 什么是自定义脚本监控

1.2 自定义脚本监控的应用场景

1.3 自定义脚本监控的优势

2. 监控脚本开发基础

2.1 脚本语言选择

2.2 监控脚本的基本结构

2.3 监控数据的输出格式

3. 系统监控脚本开发

3.1 CPU监控脚本

3.2 内存监控脚本

3.3 磁盘监控脚本

3.4 网络监控脚本

4. 应用监控脚本开发

4.1 服务状态监控脚本

4.2 Web应用监控脚本

4.3 数据库监控脚本

4. 监控工具集成

4.1 与Prometheus集成

4.2 与Zabbix集成

4.3 与Nagios/Icinga集成

5. 监控脚本的部署和管理

5.1 监控脚本的部署

5.2 监控脚本的调度

5.3 监控脚本的管理

6. 实战案例：服务器综合监控脚本

6.1 功能需求

6.2 监控脚本实现

6. 监控脚本的最佳实践

6.1 代码规范

6.2 性能优化

6.3 安全性

6.4 可维护性

7. 课程总结

7.1 重点回顾

7.2 实践建议

7.3 进阶学习