Metrics Reference - Andromeda Product Docs

All metrics listed below are available through your Grafana dashboards. Queries are automatically scoped to your assigned nodes and namespaces.

GPU

DCGM utilization, memory, thermal, ECC, NVLink, PCIe, and XID metrics.

Node

Aggregated CPU, memory, load, pressure, thermals, and EDAC metrics.

Container

CPU and memory metrics scoped to pods in your namespace.

Slurm

Job state, GPU allocation, CPU allocation, and assigned-capacity rules.

InfiniBand

Port counters, link rate, transmit wait, and fabric-level congestion metrics.

Kubernetes state

Node conditions, cordon state, pod metadata, and retained pod phase metrics.

Use Metric for dashboard or query matching, Unit for panel scale, and Description to decide whether the signal applies to your workload.

GPU (DCGM)

Labels: cluster, namespace, pod, node, gpu (index), modelName, Hostname.

Metric	Type	Unit	Description
`DCGM_FI_DEV_GPU_UTIL`	gauge	%	GPU SM utilization (0-100)
`DCGM_FI_DEV_MEM_COPY_UTIL`	gauge	%	Memory controller utilization
`DCGM_FI_DEV_GPU_TEMP`	gauge	°C	GPU die temperature
`DCGM_FI_DEV_MEMORY_TEMP`	gauge	°C	HBM memory temperature
`DCGM_FI_DEV_POWER_USAGE`	gauge	W	Current power draw
`DCGM_FI_DEV_FB_FREE`	gauge	MiB	Free framebuffer memory
`DCGM_FI_DEV_FB_USED`	gauge	MiB	Used framebuffer memory
`DCGM_FI_DEV_XID_ERRORS`	gauge		Last XID error code (0 = none)
`DCGM_FI_DEV_ECC_SBE_AGG_TOTAL`	counter		Correctable ECC errors (aggregate lifetime)
`DCGM_FI_DEV_ECC_DBE_AGG_TOTAL`	counter		Uncorrectable ECC errors (aggregate lifetime)
`DCGM_FI_DEV_ECC_SBE_VOL_TOTAL`	counter		Correctable ECC errors (since last reset)
`DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`	counter		Uncorrectable ECC errors (since last reset)
`DCGM_FI_DEV_GPU_NVLINK_ERRORS`	counter		NVLink error count
`DCGM_FI_DEV_PCIE_REPLAY_COUNTER`	counter		PCIe replay count
`DCGM_FI_DEV_ROW_REMAP_FAILURE`	gauge		Row remap failure (0 = ok, >0 = failed)
`DCGM_FI_DEV_RETIRED_SBE`	counter		Pages retired due to correctable errors
`DCGM_FI_DEV_RETIRED_DBE`	counter		Pages retired due to uncorrectable errors
`DCGM_FI_DEV_THERMAL_VIOLATION`	counter	us	Microseconds spent thermal throttling

SM profiling metrics (DCGM_FI_PROF_SM_ACTIVE, DCGM_FI_PROF_SM_OCCUPANCY) and volatile ECC counters appear when they are enabled for your cluster. If a panel is empty after you confirm filters, ask Andromeda Support whether those signals are available for your environment.

Node (aggregated)

CPU metrics are summarized per host (and often per NUMA group) so charts stay readable, rather than listing every logical CPU separately.

Metric	Type	Unit	Description
`tenant_node_cpu_seconds_total:15s_without_cpu_total`	counter	s	CPU time by mode (idle, user, system, iowait, etc.), aggregated into 8-core bins
`tenant_node_cpu_scaling_frequency_hertz:15s_without_cpu_avg`	gauge	Hz	Average CPU frequency per NUMA node
`tenant_node_memory_MemTotal_bytes`	gauge	B	Total RAM
`tenant_node_memory_MemAvailable_bytes`	gauge	B	Available RAM
`tenant_node_memory_MemFree_bytes`	gauge	B	Free RAM
`tenant_node_load1`	gauge		1-minute load average
`tenant_node_load5`	gauge		5-minute load average
`tenant_node_load15`	gauge		15-minute load average
`tenant_node_vmstat_oom_kill`	counter		OOM kill count
`tenant_node_pressure_cpu_waiting_seconds_total`	counter	s	PSI: CPU pressure (waiting)
`tenant_node_pressure_memory_waiting_seconds_total`	counter	s	PSI: Memory pressure (waiting)
`tenant_node_pressure_memory_stalled_seconds_total`	counter	s	PSI: Memory pressure (stalled)
`tenant_node_pressure_io_stalled_seconds_total`	counter	s	PSI: I/O pressure (stalled)
`tenant_node_hwmon_temp_celsius`	gauge	°C	Hardware sensor temperatures
`tenant_node_edac_correctable_errors_total`	counter		EDAC correctable memory errors
`tenant_node_edac_uncorrectable_errors_total`	counter		EDAC uncorrectable errors

Container

Scoped to pods in your namespace.

Metric	Type	Unit	Description
`container_cpu_usage_seconds_total`	counter	s	Cumulative CPU time consumed
`container_memory_working_set_bytes`	gauge	B	Current working set (determines OOM kills)
`container_memory_rss`	gauge	B	Resident set size

Slurm

Metric	Type	Labels	Description
`slurm_job_state`	gauge	`job_id`, `job_name`, `user`, `partition`, `state`	Job state (Running=1, Pending=2, etc.). No `node` label.
`slurm_job_gpus_allocated`	gauge	`job_id`, `node`	GPUs allocated to job on each node
`slurm_job_cpus_allocated`	gauge	`job_id`, `node`	CPUs allocated to job on each node

Pre-computed recording rules

Metric	Description
`tenant:slurm_nodes:assigned`	Total nodes assigned to your environment
`tenant:slurm_nodes:ready`	Healthy, schedulable nodes
`tenant:slurm_nodes:bad`	Drained, not-ready, or degraded nodes
`slurm_job:allocated_node_info`	Join of job allocation with node metadata

InfiniBand

Available where InfiniBand is deployed, not on RoCE clusters.

Metric	Type	Unit	Description
`tenant_node_infiniband_port_data_transmitted_bytes_total`	counter	B	TX bytes per IB port
`tenant_node_infiniband_port_data_received_bytes_total`	counter	B	RX bytes per IB port
`tenant_node_infiniband_port_transmit_wait_total`	counter	ticks	Backpressure ticks (4ns each)
`tenant_node_infiniband_rate_bytes_per_second`	gauge	B/s	Link rate (50GB/s = NDR/400G)
`tenant_ib_perfquery_xmit_data_bytes_total`	counter	B	Fabric-level TX bytes (64-bit counters)
`tenant_ib_perfquery_rcv_data_bytes_total`	counter	B	Fabric-level RX bytes (64-bit counters)
`tenant_ib_perfquery_xmit_wait_total`	counter	ticks	Fabric-level backpressure
`tenant_ib_perfquery_xmit_discards_total`	counter		Fabric-level congestion discards

Weka (where deployed)

Metric	Type	Description
`tenant_weka_*`	various	Weka filesystem metrics (node-local subset). Available on clusters with Weka storage.

Kubernetes state

Metric	Type	Description
`tenant_kube_node_status_condition`	gauge	Node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable)
`tenant_kube_node_spec_unschedulable`	gauge	1 if node is cordoned
`kube_pod_info`	gauge	Pod metadata (node, namespace, pod name). Useful for joins.
`kube_pod_status_phase`	gauge	Pod phase for workloads represented in monitoring (typically running jobs)

Beyond the default dashboard set

Standard dashboards emphasize GPU health, host pressure, Slurm placement, containers, InfiniBand, and Kubernetes state. If you need per-core CPU metrics, Kubernetes control-plane detail, or container network and disk statistics, contact Andromeda Support to discuss expanded coverage.

GPU

Node

Container

Slurm

InfiniBand

Kubernetes state

​GPU (DCGM)

​Node (aggregated)

​Container

​Slurm

​Pre-computed recording rules

​InfiniBand

​Weka (where deployed)

​Kubernetes state

​Beyond the default dashboard set

GPU (DCGM)

Node (aggregated)

Container

Slurm

Pre-computed recording rules

InfiniBand

Weka (where deployed)

Kubernetes state

Beyond the default dashboard set