Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.andromeda.ai/llms.txt

Use this file to discover all available pages before exploring further.

All metrics listed below are available through your Grafana dashboards. Queries are automatically scoped to your assigned nodes and namespaces.

GPU

DCGM utilization, memory, thermal, ECC, NVLink, PCIe, and XID metrics.

Node

Aggregated CPU, memory, load, pressure, thermals, and EDAC metrics.

Container

CPU and memory metrics scoped to pods in your namespace.

Slurm

Job state, GPU allocation, CPU allocation, and assigned-capacity rules.

InfiniBand

Port counters, link rate, transmit wait, and fabric-level congestion metrics.

Kubernetes state

Node conditions, cordon state, pod metadata, and retained pod phase metrics.
Use Metric for dashboard or query matching, Unit for panel scale, and Description to decide whether the signal applies to your workload.

GPU (DCGM)

Labels: cluster, namespace, pod, node, gpu (index), modelName, Hostname.
MetricTypeUnitDescription
DCGM_FI_DEV_GPU_UTILgauge%GPU SM utilization (0-100)
DCGM_FI_DEV_MEM_COPY_UTILgauge%Memory controller utilization
DCGM_FI_DEV_GPU_TEMPgauge°CGPU die temperature
DCGM_FI_DEV_MEMORY_TEMPgauge°CHBM memory temperature
DCGM_FI_DEV_POWER_USAGEgaugeWCurrent power draw
DCGM_FI_DEV_FB_FREEgaugeMiBFree framebuffer memory
DCGM_FI_DEV_FB_USEDgaugeMiBUsed framebuffer memory
DCGM_FI_DEV_XID_ERRORSgaugeLast XID error code (0 = none)
DCGM_FI_DEV_ECC_SBE_AGG_TOTALcounterCorrectable ECC errors (aggregate lifetime)
DCGM_FI_DEV_ECC_DBE_AGG_TOTALcounterUncorrectable ECC errors (aggregate lifetime)
DCGM_FI_DEV_ECC_SBE_VOL_TOTALcounterCorrectable ECC errors (since last reset)
DCGM_FI_DEV_ECC_DBE_VOL_TOTALcounterUncorrectable ECC errors (since last reset)
DCGM_FI_DEV_GPU_NVLINK_ERRORScounterNVLink error count
DCGM_FI_DEV_PCIE_REPLAY_COUNTERcounterPCIe replay count
DCGM_FI_DEV_ROW_REMAP_FAILUREgaugeRow remap failure (0 = ok, >0 = failed)
DCGM_FI_DEV_RETIRED_SBEcounterPages retired due to correctable errors
DCGM_FI_DEV_RETIRED_DBEcounterPages retired due to uncorrectable errors
DCGM_FI_DEV_THERMAL_VIOLATIONcounterusMicroseconds spent thermal throttling
SM profiling metrics (DCGM_FI_PROF_SM_ACTIVE, DCGM_FI_PROF_SM_OCCUPANCY) and volatile ECC counters appear when they are enabled for your cluster. If a panel is empty after you confirm filters, ask Andromeda Support whether those signals are available for your environment.

Node (aggregated)

CPU metrics are summarized per host (and often per NUMA group) so charts stay readable, rather than listing every logical CPU separately.
MetricTypeUnitDescription
tenant_node_cpu_seconds_total:15s_without_cpu_totalcountersCPU time by mode (idle, user, system, iowait, etc.), aggregated into 8-core bins
tenant_node_cpu_scaling_frequency_hertz:15s_without_cpu_avggaugeHzAverage CPU frequency per NUMA node
tenant_node_memory_MemTotal_bytesgaugeBTotal RAM
tenant_node_memory_MemAvailable_bytesgaugeBAvailable RAM
tenant_node_memory_MemFree_bytesgaugeBFree RAM
tenant_node_load1gauge1-minute load average
tenant_node_load5gauge5-minute load average
tenant_node_load15gauge15-minute load average
tenant_node_vmstat_oom_killcounterOOM kill count
tenant_node_pressure_cpu_waiting_seconds_totalcountersPSI: CPU pressure (waiting)
tenant_node_pressure_memory_waiting_seconds_totalcountersPSI: Memory pressure (waiting)
tenant_node_pressure_memory_stalled_seconds_totalcountersPSI: Memory pressure (stalled)
tenant_node_pressure_io_stalled_seconds_totalcountersPSI: I/O pressure (stalled)
tenant_node_hwmon_temp_celsiusgauge°CHardware sensor temperatures
tenant_node_edac_correctable_errors_totalcounterEDAC correctable memory errors
tenant_node_edac_uncorrectable_errors_totalcounterEDAC uncorrectable errors

Container

Scoped to pods in your namespace.
MetricTypeUnitDescription
container_cpu_usage_seconds_totalcountersCumulative CPU time consumed
container_memory_working_set_bytesgaugeBCurrent working set (determines OOM kills)
container_memory_rssgaugeBResident set size

Slurm

MetricTypeLabelsDescription
slurm_job_stategaugejob_id, job_name, user, partition, stateJob state (Running=1, Pending=2, etc.). No node label.
slurm_job_gpus_allocatedgaugejob_id, nodeGPUs allocated to job on each node
slurm_job_cpus_allocatedgaugejob_id, nodeCPUs allocated to job on each node

Pre-computed recording rules

MetricDescription
tenant:slurm_nodes:assignedTotal nodes assigned to your environment
tenant:slurm_nodes:readyHealthy, schedulable nodes
tenant:slurm_nodes:badDrained, not-ready, or degraded nodes
slurm_job:allocated_node_infoJoin of job allocation with node metadata

InfiniBand

Available where InfiniBand is deployed, not on RoCE clusters.
MetricTypeUnitDescription
tenant_node_infiniband_port_data_transmitted_bytes_totalcounterBTX bytes per IB port
tenant_node_infiniband_port_data_received_bytes_totalcounterBRX bytes per IB port
tenant_node_infiniband_port_transmit_wait_totalcounterticksBackpressure ticks (4ns each)
tenant_node_infiniband_rate_bytes_per_secondgaugeB/sLink rate (50GB/s = NDR/400G)
tenant_ib_perfquery_xmit_data_bytes_totalcounterBFabric-level TX bytes (64-bit counters)
tenant_ib_perfquery_rcv_data_bytes_totalcounterBFabric-level RX bytes (64-bit counters)
tenant_ib_perfquery_xmit_wait_totalcounterticksFabric-level backpressure
tenant_ib_perfquery_xmit_discards_totalcounterFabric-level congestion discards

Weka (where deployed)

MetricTypeDescription
tenant_weka_*variousWeka filesystem metrics (node-local subset). Available on clusters with Weka storage.

Kubernetes state

MetricTypeDescription
tenant_kube_node_status_conditiongaugeNode conditions (Ready, MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable)
tenant_kube_node_spec_unschedulablegauge1 if node is cordoned
kube_pod_infogaugePod metadata (node, namespace, pod name). Useful for joins.
kube_pod_status_phasegaugePod phase for workloads represented in monitoring (typically running jobs)

Beyond the default dashboard set

Standard dashboards emphasize GPU health, host pressure, Slurm placement, containers, InfiniBand, and Kubernetes state. If you need per-core CPU metrics, Kubernetes control-plane detail, or container network and disk statistics, contact Andromeda Support to discuss expanded coverage.