Troubleshooting - Andromeda Product Docs

Use this page to map common GPU, node, Slurm, InfiniBand, and container symptoms to scoped dashboard checks and queries. Replace $cluster and $node with actual values or use dashboard template variables.

Low GPU utilization

Compare GPU utilization, power, CPU pressure, I/O pressure, and memory pressure.

GPU throttling

Check GPU temperature, HBM temperature, and thermal violation counters.

ECC and XID errors

Identify uncorrectable errors, row remap failures, and critical XID codes.

InfiniBand bandwidth

Review throughput, transmit wait, link rate, and fabric congestion discards.

Slurm node mapping

Map Slurm node names to Kubernetes hostnames.

Escalation

Collect support context and identify immediate escalation triggers.

Check low GPU utilization

Determine whether GPUs are idle or bottlenecked elsewhere.

Grafana CPU utilization and CPU pressure panels for cluster nodes and workloads.

Grafana CPU panels showing CPU usage by bin, CPU temperature, load average, context switches, CPU throttling, processes, and CPU pressure.

DCGM_FI_DEV_GPU_UTIL{cluster="$cluster"}

DCGM_FI_DEV_POWER_USAGE{cluster="$cluster", node="$node"}

Idle GPUs draw ~70-100W; active H100s draw 500-700W.

rate(tenant_node_pressure_cpu_waiting_seconds_total{cluster="$cluster", node="$node"}[5m])

rate(tenant_node_pressure_io_stalled_seconds_total{cluster="$cluster", node="$node"}[5m])

rate(tenant_node_pressure_memory_stalled_seconds_total{cluster="$cluster", node="$node"}[5m])

If GPU utilization is low but power draw is near TDP and PSI metrics are clean, the bottleneck is likely application-side: synchronization, gradient accumulation, or communication overlap.

Check GPU throttling

DCGM_FI_DEV_GPU_TEMP{cluster="$cluster", node="$node"}

Throttle threshold: 83 °C.

DCGM_FI_DEV_MEMORY_TEMP{cluster="$cluster", node="$node"}

HBM throttle threshold: 95 °C.

rate(DCGM_FI_DEV_THERMAL_VIOLATION{cluster="$cluster", node="$node"}[5m])

Sustained thermal throttling reduces clock speeds and power budget. Report the node if this persists; it may indicate a cooling failure.

Check ECC errors

DCGM_FI_DEV_ECC_DBE_AGG_TOTAL{cluster="$cluster"} > 0

Any nonzero uncorrectable (DBE) value indicates a faulty GPU.

increase(DCGM_FI_DEV_ECC_SBE_AGG_TOTAL{cluster="$cluster"}[1h])

Correctable error rate >10/hr is concerning.

DCGM_FI_DEV_ROW_REMAP_FAILURE{cluster="$cluster"} > 0

Row remap failure means the GPU needs replacement. Uncorrectable errors trigger GPUUncorrectableEccErrors. If DBE errors are present and the node is still scheduling jobs, escalate immediately.

XID errors

XID errors are GPU fault codes from the NVIDIA driver. Common critical XIDs:

XID	Meaning	Impact
31	GPU memory page fault	Job crash
43	GPU stopped processing	Job hang or crash
45	Preemptive cleanup (double-bit ECC)	GPU pulled from service
48	Double-bit ECC error	GPU needs replacement
63	ECC page retirement / row remap	Degraded but functional
64	ECC page retirement (DBE)	GPU needs replacement
74	NVLink error	Multi-GPU communication failure
79	GPU fallen off bus	Node needs reboot
94	Contained ECC error	Usually recoverable
95	Uncontained ECC error	GPU needs replacement

DCGM_FI_DEV_XID_ERRORS{cluster="$cluster"} > 0

NVLink and PCIe issues

increase(DCGM_FI_DEV_GPU_NVLINK_ERRORS{cluster="$cluster"}[15m]) > 0

increase(DCGM_FI_DEV_PCIE_REPLAY_COUNTER{cluster="$cluster"}[15m])

NVLink errors degrade multi-GPU training such as AllReduce and tensor parallelism. PCIe replay errors indicate link instability between GPU and CPU or switch. More than 50 replays in 15 minutes is concerning.

Node CPU and memory

Grafana node drilldown panels showing CPU, process, memory, OOM, HugePages, and EDAC signals.

Grafana memory panels showing system memory usage, OOM kills, HugePages, memory detail, EDAC memory errors, swap usage, and memory pressure.

1 - avg by (node) (
  rate(tenant_node_cpu_seconds_total:15s_without_cpu_total{cluster="$cluster", mode="idle"}[5m])
)

1 - avg by (node, numa) (
  irate(tenant_node_cpu_seconds_total:15s_without_cpu_total{cluster="$cluster", mode="idle"}[2m])
)

Per-NUMA breakdown helps identify asymmetric CPU load.

tenant_node_memory_MemAvailable_bytes{cluster="$cluster", node="$node"}

increase(tenant_node_vmstat_oom_kill{cluster="$cluster", node="$node"}[1h])

Any OOM kill increase warrants investigation.

Check InfiniBand bandwidth

Grafana RDMA and InfiniBand panels showing port state and physical state for selected interfaces.

Grafana RDMA panels showing per-port throughput, management throughput, congestion transmit wait, TX discards, RX errors, and symbol errors.

Grafana RDMA and InfiniBand overview showing port state, physical state, aggregate throughput, and port utilization.

irate(tenant_node_infiniband_port_data_transmitted_bytes_total{cluster="$cluster", node="$node"}[1m]) * 8 / 1e9

TX bandwidth in Gbps.

irate(tenant_node_infiniband_port_transmit_wait_total{cluster="$cluster", node="$node"}[1m]) * 4 / 1000000

Transmit wait in ms/s. Any value >0 indicates congestion.

tenant_node_infiniband_rate_bytes_per_second{cluster="$cluster", node="$node"}

50GB/s = NDR/400G link.

rate(tenant_ib_perfquery_xmit_discards_total{cluster="$cluster"}[5m])

Fabric-level congestion discards.

Slurm to Kubernetes node mapping

Slurm uses names like h200-reserved-145-019. Kubernetes uses names like andromeda25-wk45. To map between them:

kube_pod_info{pod=~"h200-reserved-.*", cluster="$cluster"}

The Slurm node name matches the Kubernetes pod name for that compute instance. kube_pod_info provides the node label (Kubernetes hostname).

Tenant capacity

tenant:slurm_nodes:assigned{cluster="$cluster"}
tenant:slurm_nodes:ready{cluster="$cluster"}
tenant:slurm_nodes:bad{cluster="$cluster"}

tenant:slurm_nodes:ready{cluster="$cluster"} / tenant:slurm_nodes:assigned{cluster="$cluster"}

Readiness ratio should be 1.0.

Container resource usage

Grafana I/O panels showing disk throughput, disk I/O utilization, disk usage, and I/O pressure.

rate(container_cpu_usage_seconds_total{namespace="$namespace", pod="$pod"}[5m])

container_memory_working_set_bytes{namespace="$namespace", pod="$pod"}

Working set is the metric that determines OOM kills.

Prepare a support request

Include the following in any support request:

Cluster name and time range (absolute, not relative)
Affected nodes (Slurm names and/or Kubernetes hostnames)
Job IDs if Slurm jobs are involved
Dashboard link with the time range pinned
Observed behavior vs expected behavior

Escalate immediately if any of the following are true:

Uncorrectable ECC errors on a node that is still scheduling
tenant:slurm_nodes:ready at 0 with nonzero assigned
XID 79 (GPU fallen off bus) on any node
An alert chain where the expected consequence did not fire. See Alerts.

Observability & Alerting

Documentation Index

Low GPU utilization

GPU throttling

ECC and XID errors

InfiniBand bandwidth

Slurm node mapping

Escalation

​Check low GPU utilization

​Check GPU throttling

​Check ECC errors

​XID errors

​NVLink and PCIe issues

​Node CPU and memory

​Check InfiniBand bandwidth

​Slurm to Kubernetes node mapping

​Tenant capacity

​Container resource usage

​Prepare a support request