Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.andromeda.ai/llms.txt

Use this file to discover all available pages before exploring further.

All dashboards are pre-filtered to your assigned nodes and namespaces. Use the cluster and node dropdowns at the top of each dashboard to narrow scope.

GPU Nodes

The primary hardware health view. Start here when investigating GPU issues.
Grafana node drilldown overview showing selected node, GPU type, GPU count, active alerts, and allocation.
Grafana system dashboard panels showing CPU usage, memory usage, load average, CPU frequency, context switches, processes, swap usage, and system pressure.
PanelMetricInterpretation
GPU UtilizationDCGM_FI_DEV_GPU_UTILSustained low utilization during active training suggests a bottleneck elsewhere, such as data loading, CPU, or network.
GPU Memory UsedDCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREEOOM risk when used approaches total framebuffer.
GPU TemperatureDCGM_FI_DEV_GPU_TEMPThrottling starts above 83 °C.
GPU Memory TemperatureDCGM_FI_DEV_MEMORY_TEMPHBM thermal throttle above 95 °C.
GPU PowerDCGM_FI_DEV_POWER_USAGECompare against TDP. Sustained low power plus low utilization means idle GPUs.
ECC Errors, uncorrectableDCGM_FI_DEV_ECC_DBE_AGG_TOTALAny nonzero value indicates a faulty GPU. Triggers alerts.
ECC Errors, correctableDCGM_FI_DEV_ECC_SBE_AGG_TOTALNormal at low rates. Burst rates above 10/hr indicate degradation.
Row RemappingDCGM_FI_DEV_ROW_REMAP_FAILURENonzero means remapping failed and GPU replacement is needed.
XID ErrorsDCGM_FI_DEV_XID_ERRORSThe XID code identifies the fault type. See Troubleshooting.
NVLink ErrorsDCGM_FI_DEV_GPU_NVLINK_ERRORSAny increase indicates inter-GPU link degradation.
PCIe ReplayDCGM_FI_DEV_PCIE_REPLAY_COUNTERMore than 50 in 15 minutes suggests PCIe link instability.
Node CPUtenant_node_cpu_* aggregatedPer-NUMA CPU utilization. Data loading bottlenecks surface here.
Node Memorytenant_node_memory_MemAvailable_bytesLow available memory can cause OOM kills.

Job Analysis

Maps Slurm jobs to nodes and GPUs.
Grafana Slurm operations summary with scheduler and workload status panels.
Grafana Slurm status timeline showing SlurmRunning state over time.
PanelMetricInterpretation
Job Stateslurm_job_stateRunning, Pending, Failed, Completing. Stuck Pending jobs may indicate scheduling issues.
GPUs Allocatedslurm_job_gpus_allocatedPer-node GPU count for the job.
CPUs Allocatedslurm_job_cpus_allocatedPer-node CPU count.
Allocated Nodesslurm_job_cpus_allocated by node labelWhich Slurm nodes the job landed on.
slurm_job_state does not carry a per-node node label. For node-level detail, use slurm_job_cpus_allocated or slurm_job_gpus_allocated, which include node labels.
Slurm node names such as h200-reserved-145-019 differ from Kubernetes hostnames such as andromeda25-wk45. The dashboards handle this join automatically. For custom queries, see Troubleshooting.

Tenant Dashboard

Capacity and readiness overview.
Grafana reserved cluster table showing clusters, node count, GPU count, network receive rate, and GPU utilization.
PanelMetricInterpretation
Nodes Assignedtenant:slurm_nodes:assignedTotal nodes assigned to your environment.
Nodes Readytenant:slurm_nodes:readyNodes that are schedulable and healthy.
Nodes degradedtenant:slurm_nodes:badNodes that are drained, not-ready, or degraded.
Node Readiness Ratioready / assignedBelow 100% means some capacity is unavailable.
When tenant:slurm_nodes:bad is nonzero, open the GPU Nodes dashboard to identify which specific nodes are affected and why, such as ECC errors, thermal state, cordon, or drain.
Grafana table showing GPU reservations by active node with current, historical, and delta columns.
Grafana Weka storage panels showing throughput, IOPS, pending I/Os, and latency.