如何使用Grafana和Telegraf监控VMware ESXi

时间:2020-02-23 14:38:56  来源:igfitidea点击:

如何使用Librenms监控VMware ESXi主机。

此设置使用官方vSphere Plugin for Telegraf来从vCenter中拉出指标。
这包括vSphere Hosts Compute(RAM和CPU),网络,数据存储和在vSphere虚拟机管理程序上运行的虚拟机的指标。
让我们开始。

第1步:安装influxdb和grafana

所有收集的指标都存储在 Influx 数据库中。
Grafana将连接到 Influx ,在其仪表板上查询和显示指标。
我们需要在其他内容之前安装 Influx 和Grafana。

如何在Ubuntu,Debian和CentOS上安装influxDB

如何在Ubuntu和CentOS上安装Grafana

安装 Influx 和Grafana都进行安装后,请继续安装和配置Telegraf,该电信是一款功能强大的指标收集器。

第2步:安装和配置Telegraf

如果在步骤1上使用链接来安装influxDB,则添加了TeleGraf安装所需的存储库。
只需使用以下命令即可安装Telegraf。

# CentOS
sudo yum -y install telegraf
# Ubuntu
sudo apt-get -y install telegraf

安装后,我们需要配置TeleGraf来从vCenter中拔出监视指标。
编辑Telegraf Main配置文件:

sudo vim /etc/telegraf/telegraf.conf

1.添加 Influx 输出存储后端将存储指标。

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
    urls = ["http://10.10.1.20:8086"]
    database = "vmware"
    timeout = "0s"
    username = "monitoring"
    password = "DBPassword"

代替 10.10.1.20使用 Influx 服务器IP地址。
如果在 Influx 上没有启用身份验证,则可以安全地删除 usernamepassword配置在配置中。 2.配置 vsphereTelegraf的输入插件。
完整的配置应该类似于这类似:

# Read metrics from VMware vCenter
 [[inputs.vsphere]]
 ## List of vCenter URLs to be monitored. These three lines must be uncommented
 ## and edited for the plugin to work.
 vcenters = [ "https://10.10.1.2/sdk" ]
    username = "Hyman@theitroad"
    password = "AdminPassword"
 #
 ## VMs
 ## Typical VM metrics (if omitted or empty, all metrics are collected)
 vm_metric_include = [
      "cpu.demand.average",
      "cpu.idle.summation",
      "cpu.latency.average",
      "cpu.readiness.average",
      "cpu.ready.summation",
      "cpu.run.summation",
      "cpu.usagemhz.average",
      "cpu.used.summation",
      "cpu.wait.summation",
      "mem.active.average",
      "mem.granted.average",
      "mem.latency.average",
      "mem.swapin.average",
      "mem.swapinRate.average",
      "mem.swapout.average",
      "mem.swapoutRate.average",
      "mem.usage.average",
      "mem.vmmemctl.average",
      "net.bytesRx.average",
      "net.bytesTx.average",
      "net.droppedRx.summation",
      "net.droppedTx.summation",
      "net.usage.average",
      "power.power.average",
      "virtualDisk.numberReadAveraged.average",
      "virtualDisk.numberWriteAveraged.average",
      "virtualDisk.read.average",
      "virtualDisk.readOIO.latest",
      "virtualDisk.throughput.usage.average",
      "virtualDisk.totalReadLatency.average",
      "virtualDisk.totalWriteLatency.average",
      "virtualDisk.write.average",
      "virtualDisk.writeOIO.latest",
      "sys.uptime.latest",
    ]
 # vm_metric_exclude = [] ## Nothing is excluded by default
 # vm_instances = true ## true by default
 #
 ## Hosts
 ## Typical host metrics (if omitted or empty, all metrics are collected)
 host_metric_include = [
      "cpu.coreUtilization.average",
      "cpu.costop.summation",
      "cpu.demand.average",
      "cpu.idle.summation",
      "cpu.latency.average",
      "cpu.readiness.average",
      "cpu.ready.summation",
      "cpu.swapwait.summation",
      "cpu.usage.average",
      "cpu.usagemhz.average",
      "cpu.used.summation",
      "cpu.utilization.average",
      "cpu.wait.summation",
      "disk.deviceReadLatency.average",
      "disk.deviceWriteLatency.average",
      "disk.kernelReadLatency.average",
      "disk.kernelWriteLatency.average",
      "disk.numberReadAveraged.average",
      "disk.numberWriteAveraged.average",
      "disk.read.average",
      "disk.totalReadLatency.average",
      "disk.totalWriteLatency.average",
      "disk.write.average",
      "mem.active.average",
      "mem.latency.average",
      "mem.state.latest",
      "mem.swapin.average",
      "mem.swapinRate.average",
      "mem.swapout.average",
      "mem.swapoutRate.average",
      "mem.totalCapacity.average",
      "mem.usage.average",
      "mem.vmmemctl.average",
      "net.bytesRx.average",
      "net.bytesTx.average",
      "net.droppedRx.summation",
      "net.droppedTx.summation",
      "net.errorsRx.summation",
      "net.errorsTx.summation",
      "net.usage.average",
      "power.power.average",
      "storageAdapter.numberReadAveraged.average",
      "storageAdapter.numberWriteAveraged.average",
      "storageAdapter.read.average",
      "storageAdapter.write.average",
      "sys.uptime.latest",
    ]
 # host_metric_exclude = [] ## Nothing excluded by default
 # host_instances = true ## true by default
 #
 ## Clusters
 cluster_metric_include = [] ## if omitted or empty, all metrics are collected
 # cluster_metric_exclude = [] ## Nothing excluded by default
 # cluster_instances = false ## false by default
 #
 ## Datastores
 datastore_metric_include = [] ## if omitted or empty, all metrics are collected
 # datastore_metric_exclude = [] ## Nothing excluded by default
 # datastore_instances = false ## false by default for Datastores only
 #
 ## Datacenters
 datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
# datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
 # datacenter_instances = false ## false by default for Datastores only
 #
 ## Plugin Settings
 ## separator character to use for measurement and field names (default: "_")
 # separator = "_"
 #
 ## number of objects to retreive per query for realtime resources (vms and hosts)
 ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
 # max_query_objects = 256
 #
 ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
 ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
 # max_query_metrics = 256
 #
 ## number of go routines to use for collection and discovery of objects and metrics
 # collect_concurrency = 1
 # discover_concurrency = 1
 #
 ## whether or not to force discovery of new objects on initial gather call before collecting metrics
 ## when true for large environments this Jan cause errors for time elapsed while collecting metrics
 ## when false (default) the first collection cycle Jan result in no or limited metrics while objects are discovered
 # force_discover_on_init = false
 #
 ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
 # object_discovery_interval = "300s"
 #
 ## timeout applies to any of the api request made to vcenter
 # timeout = "60s"
 #
 ## Optional SSL Config
 # ssl_ca = "/path/to/cafile"
 # ssl_cert = "/path/to/certfile"
 # ssl_key = "/path/to/keyfile"
 ## Use SSL but skip chain & host verification
 insecure_skip_verify = true

唯一变化的变量是:10.10.1.2应该用vCenter IP地址@ OnItoad替换,应将vCenter用户账户与密码匹配以进行身份验证

如果vCenter Server有一个自签名证书,请确保转弯 insecure_skip_verify flag为真。

insecure_skip_verify = true

在进行更改后启动并启用Telegraf服务。

sudo systemctl restart telegraf
sudo systemctl enable telegraf

第3步:检查 Influx 度量标准

我们需要确认我们的指标被推到 Influx ,我们可以看到它们。

打开 Influx 壳:

具有身份验证:

$influx -username 'username' -password 'StrongPassword'
Connected to http://localhost:8086 version 1.6.4
InfluxDB shell version: 1.6.4

'用户名' - influxDB身份验证用户名'strongpassword' - influxdb密码

没有身份验证:

$influx
Connected to http://localhost:8086 version 1.6.4
InfluxDB shell version: 1.6.4

切换到 vmware我们在Telegraf上配置的数据库。

> USE vmware
Using database vmware

检查是否有时间序列度量的流入。

> SHOW MEASUREMENTS
name: measurements
name
---
cpu
disk
diskio
kernel
mem
processes
swap
system
vsphere_cluster_clusterServices
vsphere_cluster_mem
vsphere_cluster_vmop
vsphere_datacenter_vmop
vsphere_datastore_datastore
vsphere_datastore_disk
vsphere_host_cpu
vsphere_host_disk
vsphere_host_mem
vsphere_host_net
vsphere_host_power
vsphere_host_storageAdapter
vsphere_host_sys
vsphere_vm_cpu
vsphere_vm_mem
vsphere_vm_net
vsphere_vm_power
vsphere_vm_sys
vsphere_vm_virtualDisk
>

第3步:将 Influx 数据源添加到Grafana

登录Grafana并添加 Influx 数据源 - 如果适用,请指定服务器IP,数据库名称和身份验证凭据。

给它一个名称,选择类型,指定服务器IP。

如果适用,请提供数据库名称和身份验证凭据。

保存和测试设置。

第4步:导入Grafana仪表板

我们已将所有依赖项配置和测试配置为工作。
最后一个操作是创建或者导入将显示vSphere度量标准的Grafana仪表板。