NVIDIA Integration

Learn about the NVIDIA Integration.

This page provides an overview of what you can do with the NVIDIA integration. The documentation pages only for a limited number of integrations contain the setup steps and instructions. If you do not see the setup steps here, navigate to the Operations for Applications GUI. The detailed instructions for setting up and configuring all integrations, including the NVIDIA integration are on the Setup tab of the integration.

Log in to your Operations for Applications instance.
Click Integrations on the toolbar, search for and click the NVIDIA tile.
Click the Setup tab and you will see the most recent and up-to-date instructions.

NVIDIA: This integration allows you to monitor your NVDIA GPU cluster. It uses the Telegraf NVIDIA plugin to pull the GPU name, fan speed, memory usage, temperature, and UUID.
NVIDIA on Kubernetes: This explains the configuration of the Kubernetes Metrics Collector to scrape NVIDIA metrics using NVIDIA Data Center GPU Manager (DCGM) tool.

In addition to setting up the metrics flow, this integration also installs dashboards:

NVIDIA
NVIDIA on Kubernetes

images/nvidia-dashboard-preview.png

NVIDIA

Metric Name	Description
nvidia.smi.temperature.gpu	The current temperature readings for the device, in degrees C.
nvidia.smi.utilization.gpu	The GPU utilization in percentages.
nvidia.smi.utilization.memory	The memory utilization in percentages.
nvidia.smi.fan.speed	The fan speed in percentages.
nvidia.smi.power.draw	The total power drawn in watts.

NVIDIA on Kubernetes

Metric Name	Description
DCGM.FI.DEV.DEC.UTIL.gauge	The decoder utilization in percentages.
DCGM.FI.DEV.ENC.UTIL.gauge	The encoder utilization in percentages.
DCGM.FI.DEV.FB.FREE.gauge	The free frame buffer in MB.
DCGM.FI.DEV.FB.USED.gauge	The used frame buffer in MB.
DCGM.FI.DEV.GPU.TEMP.gauge	The current temperature readings for the device, in degrees C.
DCGM.FI.DEV.MEM.CLOCK.gauge	The memory clock for the device.
DCGM.FI.DEV.MEM.COPY.UTIL.gauge	The memory utilization in percentages.
DCGM.FI.DEV.NVLINK.BANDWIDTH.TOTAL.counter	The total bandwidth of NVLINK
DCGM.FI.DEV.PCIE.REPLAY.COUNTER.counter	The PCIe replay counter.
DCGM.FI.DEV.POWER.USAGE.gauge	The power usage for the device in watts.
DCGM.FI.DEV.SM.CLOCK.gauge	The SM clock for the device.
DCGM.FI.DEV.TOTAL.ENERGY.CONSUMPTION.counter	The total energy consumption for the GPU in mJ since the driver was last reloaded.
DCGM.FI.DEV.VGPU.LICENSE.STATUS.gaugee	The license status of the vGPU instance.
DCGM.FI.DEV.XID.ERRORS.gauge	XID errors. The value is the specific XID error.
DCGM.FI.PROF.DRAM.ACTIVE.gauge	The ratio of cycles the device memory interface is active sending or receiving data.
DCGM.FI.PROF.GR.ENGINE.ACTIVE.gauge	The ratio of time the graphics engine is active.
DCGM.FI.PROF.PCIE.RX.BYTES.counter	The number of bytes of active PCIe Rx data including both header and payload.
DCGM.FI.PROF.PCIE.TX.BYTES.counter	The number of bytes of active PCIe Tx data including both header and payload.
DCGM.FI.PROF.PIPE.TENSOR.ACTIVE.gauge	The ratio of cycles the tensor (HMMA) pipe is active.