官方api文档:https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudwatch.html#CloudWatch.Client.put_metric_alarm
一、配置实例
1.1开启ec2
#备注:EC2有对应iam角色,内存磁盘监控脚本参考文档:http://hub.bosicloud.com/pages/viewpage.action?pageId=12356151
1.2命令测试
[k8s-prod-sg@ip-172-20-21-242 tmp]$ aws ec2 describe-instances --filters Name=tag:cloudwatch,Values=agent Name=tag:sys,Values=linux --query "Reservations[*].Instances[*].InstanceId"
二、配置SNS消息推送
因为预先已经配置,不再叙述。配置参考文档:https://aws.amazon.com/cn/blogs/china/enterprise-wechat-and-dingtalk-receiving-amazon-cloudwatch-alarms/
三、脚本
3.1定义EC2标签
这里需要定义EC2两组标签:cloudwatch:agent sys:linux 表示系统为linux实例接入alert告警,之后会用到两组标签获取对应实例id
3.2批量管理脚本如下
[k8s-prod-sg@ip-172-20-21-242 tmp]$ vi ec2-alert.py
#!/user/bin/env python3
# -*- coding: utf-8 -*-
# Yusin
# 2021.6.26
# v1.0.1
# 批量添加EC2告警,该脚本仅支持linux实例告警
import subprocess,re
import boto3
client = boto3.client('cloudwatch')
ec2_cmd = '/usr/local/bin/aws ec2 describe-instances --filters Name=tag:cloudwatch,Values=agent Name=tag:sys,Values=linux --query "Reservations[*].Instances[*].InstanceId"'
# CPUUtilization
def get_ec2_utilization(instanceid):
dic1 = {}
#构造告警信息
dic1['Namespace'] = "AWS/EC2"
dic1['MetricName'] = "CPUUtilization"
dic1['Dimensions'] = [
{
"Name": "InstanceId",
"Value": instanceid
}
]
dic1['AlarmActions'] = [
"arn:aws:sns:ap-southeast-1:553972031073:Default_CloudWatch_Alarms_Topic"
]
dic1['ComparisonOperator'] = "GreaterThanThreshold"
dic1['DatapointsToAlarm'] = 3
dic1['EvaluationPeriods'] = 4
dic1['Period'] = 60
dic1['Statistic'] = "Average"
dic1['Threshold'] = 90
dic1['AlarmDescription'] = "alert:CPU Utilization of {id} with 40% as threshold".format(id=instanceid)
dic1['AlarmName'] = "Instance %s CPU Utilization" % (instanceid)
response = client.put_metric_alarm(**dic1)
return response
#disk_used
def get_ec2_device(instanceid):
dic1 = {}
#构造告警信息
dic1['Namespace'] = "CWAgent"
dic1['MetricName'] = "disk_used_percent"
dic1['Dimensions'] = [
{
"Name": "InstanceId",
"Value": instanceid
},
{
"Name": "path",
"Value": "/",
},
{
"Name": "device",
"Value": "xvda1",
},
{
"Name": "fstype",
"Value": "xfs",
},
]
dic1['AlarmActions'] = [
"arn:aws:sns:ap-southeast-1:553972031073:Default_CloudWatch_Alarms_Topic"
]
dic1['ComparisonOperator'] = "GreaterThanThreshold"
dic1['DatapointsToAlarm'] = 3
dic1['EvaluationPeriods'] = 4
dic1['Period'] = 60
dic1['Statistic'] = "Average"
dic1['Threshold'] = 90
dic1['AlarmDescription'] = "alert:disk_used_percent of {id} with 90% as threshold".format(id=instanceid)
dic1['AlarmName'] = "Instance %s disk_used_percent" % (instanceid)
response = client.put_metric_alarm(**dic1)
return response
# mem_used
def get_ec2_mem(instanceid):
dic1 = {}
#构造告警信息
dic1['Namespace'] = "CWAgent"
dic1['MetricName'] = "mem_used_percent"
dic1['Dimensions'] = [
{
"Name": "InstanceId",
"Value": instanceid
}
]
dic1['AlarmActions'] = [
"arn:aws:sns:ap-southeast-1:553972031073:Default_CloudWatch_Alarms_Topic"
]
dic1['ComparisonOperator'] = "GreaterThanThreshold"
dic1['DatapointsToAlarm'] = 3
dic1['EvaluationPeriods'] = 4
dic1['Period'] = 60
dic1['Statistic'] = "Average"
dic1['Threshold'] = 90
dic1['AlarmDescription'] = "alert:mem_used of {id} with 40% as threshold".format(id=instanceid)
dic1['AlarmName'] = "Instance %s mem_used" % (instanceid)
response = client.put_metric_alarm(**dic1)
return response
#捕获EC2实例id,如果不存在则返回'error_id'
def get_ec2_instanceid(cmd):
pattern = re.compile(r'i-\w+')
ret = subprocess.run(cmd,shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE,encoding="utf-8",timeout=10)
# print('code:',ret.returncode)
if ret.returncode == 0:
ret = pattern.findall(ret.stdout)
return ret
else:
return 'error_id'
if __name__ == '__main__':
instanceid_list = get_ec2_instanceid(ec2_cmd)
# print(instanceid_list)
print('instanceid_list:',instanceid_list)
if instanceid_list != "error_id":
for id in instanceid_list:
# cpuutilization告警
ret = get_ec2_utilization(id)
if ret['ResponseMetadata']['HTTPStatusCode'] == 200:
print('utilization告警添加成功')
else:
print('utilization告警添加失败')
ret = get_ec2_device(id)
if ret['ResponseMetadata']['HTTPStatusCode'] == 200:
print('device告警添加成功')
else:
print('device告警添加失败')
ret = get_ec2_mem(id)
if ret['ResponseMetadata']['HTTPStatusCode'] == 200:
print('mem告警添加成功')
else:
print('mem告警添加失败')
[k8s-prod-de@ip-172-20-21-242 ~]pip3 install boto3
[k8s-prod-sg@ip-172-20-21-242 tmp] python3 ec2-alert.py
四、效果
五、总结
- 最开始想到网上白嫖一个文档解决批量配置告警需求,可是一番操作下来失败了,我也不清楚怎么回事。参考文档:https://aws.amazon.com/cn/blogs/china/add-alarms-in-batches-for-ec2-on-cloudwatch/
- 基础资源告警仅配置cpu,内存,磁盘三块,其他使用prometheus实现。脚本目前比较原始,没有经过优化。
留言