官方api文档:https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudwatch.html#CloudWatch.Client.put_metric_alarm

一、配置实例

1.1开启ec2

#备注:EC2有对应iam角色,内存磁盘监控脚本参考文档:http://hub.bosicloud.com/pages/viewpage.action?pageId=12356151

1.2命令测试
[k8s-prod-sg@ip-172-20-21-242 tmp]$ aws ec2 describe-instances --filters Name=tag:cloudwatch,Values=agent Name=tag:sys,Values=linux --query "Reservations[*].Instances[*].InstanceId"

二、配置SNS消息推送

因为预先已经配置,不再叙述。配置参考文档:https://aws.amazon.com/cn/blogs/china/enterprise-wechat-and-dingtalk-receiving-amazon-cloudwatch-alarms/

三、脚本

3.1定义EC2标签

这里需要定义EC2两组标签:cloudwatch:agent sys:linux 表示系统为linux实例接入alert告警,之后会用到两组标签获取对应实例id

3.2批量管理脚本如下
[k8s-prod-sg@ip-172-20-21-242 tmp]$ vi ec2-alert.py
#!/user/bin/env python3
# -*- coding: utf-8 -*-
# Yusin
# 2021.6.26
# v1.0.1
# 批量添加EC2告警,该脚本仅支持linux实例告警

import subprocess,re
import boto3

client = boto3.client('cloudwatch')
ec2_cmd = '/usr/local/bin/aws ec2 describe-instances --filters Name=tag:cloudwatch,Values=agent Name=tag:sys,Values=linux --query "Reservations[*].Instances[*].InstanceId"'

# CPUUtilization
def get_ec2_utilization(instanceid):
    dic1 = {}
    #构造告警信息
    dic1['Namespace'] = "AWS/EC2"
    dic1['MetricName'] = "CPUUtilization"
    dic1['Dimensions'] = [
        {
            "Name": "InstanceId",
            "Value": instanceid
        }
    ]
    dic1['AlarmActions'] = [
        "arn:aws:sns:ap-southeast-1:553972031073:Default_CloudWatch_Alarms_Topic"
    ]
    dic1['ComparisonOperator'] = "GreaterThanThreshold"
    dic1['DatapointsToAlarm'] = 3
    dic1['EvaluationPeriods'] = 4
    dic1['Period'] = 60
    dic1['Statistic'] = "Average"
    dic1['Threshold'] = 90
    dic1['AlarmDescription'] = "alert:CPU Utilization of {id} with 40% as threshold".format(id=instanceid)
    dic1['AlarmName'] = "Instance %s CPU Utilization" % (instanceid)
    response = client.put_metric_alarm(**dic1)
    return response

#disk_used
def get_ec2_device(instanceid):
    dic1 = {}
    #构造告警信息
    dic1['Namespace'] = "CWAgent"
    dic1['MetricName'] = "disk_used_percent"
    dic1['Dimensions'] = [
        {
            "Name": "InstanceId",
            "Value": instanceid
        },
        {
            "Name": "path",
            "Value": "/",
        },
        {
            "Name": "device",
            "Value": "xvda1",
        },
        {
            "Name": "fstype",
            "Value": "xfs",
        },
    ]
    dic1['AlarmActions'] = [
        "arn:aws:sns:ap-southeast-1:553972031073:Default_CloudWatch_Alarms_Topic"
    ]
    dic1['ComparisonOperator'] = "GreaterThanThreshold"
    dic1['DatapointsToAlarm'] = 3
    dic1['EvaluationPeriods'] = 4
    dic1['Period'] = 60
    dic1['Statistic'] = "Average"
    dic1['Threshold'] = 90
    dic1['AlarmDescription'] = "alert:disk_used_percent of {id} with 90% as threshold".format(id=instanceid)
    dic1['AlarmName'] = "Instance %s disk_used_percent" % (instanceid)
    response = client.put_metric_alarm(**dic1)
    return response

# mem_used
def get_ec2_mem(instanceid):
    dic1 = {}
    #构造告警信息
    dic1['Namespace'] = "CWAgent"
    dic1['MetricName'] = "mem_used_percent"
    dic1['Dimensions'] = [
        {
            "Name": "InstanceId",
            "Value": instanceid
        }
    ]
    dic1['AlarmActions'] = [
        "arn:aws:sns:ap-southeast-1:553972031073:Default_CloudWatch_Alarms_Topic"
    ]
    dic1['ComparisonOperator'] = "GreaterThanThreshold"
    dic1['DatapointsToAlarm'] = 3
    dic1['EvaluationPeriods'] = 4
    dic1['Period'] = 60
    dic1['Statistic'] = "Average"
    dic1['Threshold'] = 90
    dic1['AlarmDescription'] = "alert:mem_used of {id} with 40% as threshold".format(id=instanceid)
    dic1['AlarmName'] = "Instance %s mem_used" % (instanceid)
    response = client.put_metric_alarm(**dic1)
    return response

#捕获EC2实例id,如果不存在则返回'error_id'
def get_ec2_instanceid(cmd):
    pattern = re.compile(r'i-\w+')
    ret = subprocess.run(cmd,shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE,encoding="utf-8",timeout=10)
    # print('code:',ret.returncode)
    if ret.returncode == 0:
        ret = pattern.findall(ret.stdout)
        return ret
    else:
        return 'error_id'

if __name__ == '__main__':
    instanceid_list = get_ec2_instanceid(ec2_cmd)
    # print(instanceid_list)
    print('instanceid_list:',instanceid_list)
    if instanceid_list != "error_id":
        for id in instanceid_list:
            # cpuutilization告警
            ret = get_ec2_utilization(id)
            if ret['ResponseMetadata']['HTTPStatusCode'] == 200:
                print('utilization告警添加成功')
            else:
                print('utilization告警添加失败')

            ret = get_ec2_device(id)
            if ret['ResponseMetadata']['HTTPStatusCode'] == 200:
                print('device告警添加成功')
            else:
                print('device告警添加失败')

            ret = get_ec2_mem(id)
            if ret['ResponseMetadata']['HTTPStatusCode'] == 200:
                print('mem告警添加成功')
            else:
                print('mem告警添加失败')
[k8s-prod-de@ip-172-20-21-242 ~]pip3 install boto3
[k8s-prod-sg@ip-172-20-21-242 tmp] python3 ec2-alert.py

四、效果

五、总结

  • 最开始想到网上白嫖一个文档解决批量配置告警需求,可是一番操作下来失败了,我也不清楚怎么回事。参考文档:https://aws.amazon.com/cn/blogs/china/add-alarms-in-batches-for-ec2-on-cloudwatch/
  • 基础资源告警仅配置cpu,内存,磁盘三块,其他使用prometheus实现。脚本目前比较原始,没有经过优化。
最后修改日期: 2023年12月16日

作者

留言

撰写回覆或留言

发布留言必须填写的电子邮件地址不会公开。