一、背景
- 本地开发测试环境使用vm虚拟化搭建,同时运行这k8s集群,当前存在着一个问题则是如何监控其他虚拟主机(非k8s集群node节点)。为了降低监控成本这里同样采用prometheus进行监控。参考文档:https://github.com/prometheus/node_exporter
-
官方就node expoter自动发现监控上没有做过多的介绍(事实上本地自建的虚拟机做自动发现功能存在一定的难度。对于云计算平台自动发现功能监控参考文档:https://prometheus.io/docs/prometheus/latest/configuration/configuration/)
- 这里通过scrape_configs定义consul_sd_config对本地虚拟主机进行自动发现监控。参考文档:https://prometheus.io/docs/prometheus/latest/configuration/configuration/
- 架构说明,通过将本地每台的虚拟主机的注册信息注册至consul,prometheus+consul部署自动发现虚拟主机。而后运行周期性注销连接失败的服务(主机下线)。
二、安装consul
[k8s-dev-test@rancher-k8s-conn consul]$ helm upgrade -i node-expoter-consul -n kube-ops .
三、关联promtheus
参考:prometheus配置additional 参考文档见:https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/additional-scrape-config.md
[k8s-dev-test@rancher-k8s-conn tmp]$ vi prometheus-additional.yaml
- job_name: 'node-expoter-consul-prometheus'
consul_sd_configs:
- server: 'node-expoter-consul.kube-ops:8500'
services: ['server']
#备注:这里我仅定义了server这个组,如果有多个需求可以定义多个不同分类。例如webserver、dbserver
四、注册
参考脚本:https://github.com/xiangys0134/deploy/blob/master/%E5%BE%AE%E6%9C%8D%E5%8A%A1/prometheus/register_consul_node.sh
#!/bin/bash
# linux初始化nodeport
# Author yousong.xiang
# Date 2021.9.30
# v1.0.1
node_exporter_version='1.1.2'
# download_path="https://github.com/prometheus/node_exporter/releases/download/v{node_exporter_version}/node_exporter-{node_exporter_version}.linux-amd64.tar.gz"
download_path="https://shelllinux.oss-cn-shanghai.aliyuncs.com/deploy/source/node_exporter-{node_exporter_version}.linux-amd64.tar.gz"
node_exporter_conf='/usr/lib/systemd/system/node_exporter.service'
function env_check {
uid=(id -u)
if [ {uid} -ne 0 ]; then
echo '==此脚本需要root用户执行,程序即将退出.'
exit 2
fi
ping -c 1 -W 2 www.aliyun.com >/dev/null 2>&1
if [? -ne 0 ]; then
echo '==网络不通,请检查网络'
exit 6
fi
}
function check_rpm {
rpm_name=1
num=`rpm -qa|grep{rpm_name}|wc -l`
echo {num}
}
function register_conf {
if [ ! -f{node_exporter_conf} ]; then
cat>>{node_exporter_conf}<<EOF
[Unit]
Description=node_exporter
[Service]
Type=simple
Restart=on-failure
ExecStart=/usr/local/node_exporter/node_exporter --web.listen-address=0.0.0.0:9101
[Install]
WantedBy=multi-user.target
EOF
fi
systemctl daemon-reload
systemctl start node_exporter.service
systemctl enable node_exporter.service
}
function register_consul {
service_name=1
instance_id=2
ip=3
port=4
curl -X PUT -d '{"id":"'"instance_id"'",
"name":"'"service_name"'",
"address":"'"ip"'",
"port":'"port"',
"tags":["'"service_name"'"],
"checks":[{"http": "http://'"ip"':'"port"'","interval":"10s"}]}' \
http://192.168.7.45:30501/v1/agent/service/register
}
function main {
env_check
if [ `check_rpm wget` -eq 0 ]; then
yum install -y wget
fi
echo '正在下载包...'
wget {download_path}
if [? -ne 0 ]; then
echo '包下载失败,请检查网络!'
exit 3
fi
tar -zxvf node_exporter-{node_exporter_version}.linux-amd64.tar.gz
mv node_exporter-{node_exporter_version}.linux-amd64 /usr/local/node_exporter
register_conf
rm -rf node_exporter-{node_exporter_version}.linux-amd64.tar.gz
}
if [# -eq 0 ]; then
main
else
main
#传参示例:bash -x a.sh server rancher-k8s-conn 192.168.7.58 9101
register_consul $@
fi
五、注销服务
consul注销服务参考:https://www.consul.io/api-docs/catalog
周期性的注销无效服务(可以将脚本封装至docker镜像中,通过cron周期性的运行任务),参考脚本如下:
#!/bin/bash
# consul删除安装
# Author yousong.xiang
# Date 2021.9.30
# v1.0.1
#k8s service负载地址
consul_address='node-expoter-consul.kube-ops:8500'
[ -f /etc/profile ] && . /etc/profile
yum install -y epel-release jq
yum install -y jq
function consul_del {
CONSUL_NODES=(curl -s -XGET http://{consul_address}/v1/catalog/nodes | jq -r '.[].Address')
CONSUL_CRITICAL=(curl -s -XGET http://{consul_address}/v1/health/state/critical | jq -r '.[].ServiceID')
for critical in {CONSUL_CRITICAL}
do
for consul_ip in{CONSUL_NODES}
do
curl -s -XPUT http://{consul_ip}:8500/v1/agent/service/deregister/{critical} &> /dev/null
echo "${critical} 已删除" >> consul_del-`date +%Y%m%d`.log
done
done
}
consul_del
留言