All metrics listed below are available through your Grafana dashboards. Queries are automatically scoped to your assigned nodes and namespaces.Documentation Index
Fetch the complete documentation index at: https://docs.andromeda.ai/llms.txt
Use this file to discover all available pages before exploring further.
GPU
DCGM utilization, memory, thermal, ECC, NVLink, PCIe, and XID metrics.
Node
Aggregated CPU, memory, load, pressure, thermals, and EDAC metrics.
Container
CPU and memory metrics scoped to pods in your namespace.
Slurm
Job state, GPU allocation, CPU allocation, and assigned-capacity rules.
InfiniBand
Port counters, link rate, transmit wait, and fabric-level congestion metrics.
Kubernetes state
Node conditions, cordon state, pod metadata, and retained pod phase metrics.
Metric for dashboard or query matching, Unit for panel scale, and Description to decide whether the signal applies to your workload.
GPU (DCGM)
Labels:cluster, namespace, pod, node, gpu (index), modelName, Hostname.
| Metric | Type | Unit | Description |
|---|---|---|---|
DCGM_FI_DEV_GPU_UTIL | gauge | % | GPU SM utilization (0-100) |
DCGM_FI_DEV_MEM_COPY_UTIL | gauge | % | Memory controller utilization |
DCGM_FI_DEV_GPU_TEMP | gauge | °C | GPU die temperature |
DCGM_FI_DEV_MEMORY_TEMP | gauge | °C | HBM memory temperature |
DCGM_FI_DEV_POWER_USAGE | gauge | W | Current power draw |
DCGM_FI_DEV_FB_FREE | gauge | MiB | Free framebuffer memory |
DCGM_FI_DEV_FB_USED | gauge | MiB | Used framebuffer memory |
DCGM_FI_DEV_XID_ERRORS | gauge | Last XID error code (0 = none) | |
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL | counter | Correctable ECC errors (aggregate lifetime) | |
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL | counter | Uncorrectable ECC errors (aggregate lifetime) | |
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL | counter | Correctable ECC errors (since last reset) | |
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL | counter | Uncorrectable ECC errors (since last reset) | |
DCGM_FI_DEV_GPU_NVLINK_ERRORS | counter | NVLink error count | |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER | counter | PCIe replay count | |
DCGM_FI_DEV_ROW_REMAP_FAILURE | gauge | Row remap failure (0 = ok, >0 = failed) | |
DCGM_FI_DEV_RETIRED_SBE | counter | Pages retired due to correctable errors | |
DCGM_FI_DEV_RETIRED_DBE | counter | Pages retired due to uncorrectable errors | |
DCGM_FI_DEV_THERMAL_VIOLATION | counter | us | Microseconds spent thermal throttling |
SM profiling metrics (
DCGM_FI_PROF_SM_ACTIVE, DCGM_FI_PROF_SM_OCCUPANCY) and volatile ECC counters appear when they are enabled for your cluster. If a panel is empty after you confirm filters, ask Andromeda Support whether those signals are available for your environment.Node (aggregated)
CPU metrics are summarized per host (and often per NUMA group) so charts stay readable, rather than listing every logical CPU separately.| Metric | Type | Unit | Description |
|---|---|---|---|
tenant_node_cpu_seconds_total:15s_without_cpu_total | counter | s | CPU time by mode (idle, user, system, iowait, etc.), aggregated into 8-core bins |
tenant_node_cpu_scaling_frequency_hertz:15s_without_cpu_avg | gauge | Hz | Average CPU frequency per NUMA node |
tenant_node_memory_MemTotal_bytes | gauge | B | Total RAM |
tenant_node_memory_MemAvailable_bytes | gauge | B | Available RAM |
tenant_node_memory_MemFree_bytes | gauge | B | Free RAM |
tenant_node_load1 | gauge | 1-minute load average | |
tenant_node_load5 | gauge | 5-minute load average | |
tenant_node_load15 | gauge | 15-minute load average | |
tenant_node_vmstat_oom_kill | counter | OOM kill count | |
tenant_node_pressure_cpu_waiting_seconds_total | counter | s | PSI: CPU pressure (waiting) |
tenant_node_pressure_memory_waiting_seconds_total | counter | s | PSI: Memory pressure (waiting) |
tenant_node_pressure_memory_stalled_seconds_total | counter | s | PSI: Memory pressure (stalled) |
tenant_node_pressure_io_stalled_seconds_total | counter | s | PSI: I/O pressure (stalled) |
tenant_node_hwmon_temp_celsius | gauge | °C | Hardware sensor temperatures |
tenant_node_edac_correctable_errors_total | counter | EDAC correctable memory errors | |
tenant_node_edac_uncorrectable_errors_total | counter | EDAC uncorrectable errors |
Container
Scoped to pods in your namespace.| Metric | Type | Unit | Description |
|---|---|---|---|
container_cpu_usage_seconds_total | counter | s | Cumulative CPU time consumed |
container_memory_working_set_bytes | gauge | B | Current working set (determines OOM kills) |
container_memory_rss | gauge | B | Resident set size |
Slurm
| Metric | Type | Labels | Description |
|---|---|---|---|
slurm_job_state | gauge | job_id, job_name, user, partition, state | Job state (Running=1, Pending=2, etc.). No node label. |
slurm_job_gpus_allocated | gauge | job_id, node | GPUs allocated to job on each node |
slurm_job_cpus_allocated | gauge | job_id, node | CPUs allocated to job on each node |
Pre-computed recording rules
| Metric | Description |
|---|---|
tenant:slurm_nodes:assigned | Total nodes assigned to your environment |
tenant:slurm_nodes:ready | Healthy, schedulable nodes |
tenant:slurm_nodes:bad | Drained, not-ready, or degraded nodes |
slurm_job:allocated_node_info | Join of job allocation with node metadata |
InfiniBand
Available where InfiniBand is deployed, not on RoCE clusters.| Metric | Type | Unit | Description |
|---|---|---|---|
tenant_node_infiniband_port_data_transmitted_bytes_total | counter | B | TX bytes per IB port |
tenant_node_infiniband_port_data_received_bytes_total | counter | B | RX bytes per IB port |
tenant_node_infiniband_port_transmit_wait_total | counter | ticks | Backpressure ticks (4ns each) |
tenant_node_infiniband_rate_bytes_per_second | gauge | B/s | Link rate (50GB/s = NDR/400G) |
tenant_ib_perfquery_xmit_data_bytes_total | counter | B | Fabric-level TX bytes (64-bit counters) |
tenant_ib_perfquery_rcv_data_bytes_total | counter | B | Fabric-level RX bytes (64-bit counters) |
tenant_ib_perfquery_xmit_wait_total | counter | ticks | Fabric-level backpressure |
tenant_ib_perfquery_xmit_discards_total | counter | Fabric-level congestion discards |
Weka (where deployed)
| Metric | Type | Description |
|---|---|---|
tenant_weka_* | various | Weka filesystem metrics (node-local subset). Available on clusters with Weka storage. |
Kubernetes state
| Metric | Type | Description |
|---|---|---|
tenant_kube_node_status_condition | gauge | Node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable) |
tenant_kube_node_spec_unschedulable | gauge | 1 if node is cordoned |
kube_pod_info | gauge | Pod metadata (node, namespace, pod name). Useful for joins. |
kube_pod_status_phase | gauge | Pod phase for workloads represented in monitoring (typically running jobs) |