Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.andromeda.ai/llms.txt

Use this file to discover all available pages before exploring further.

Andromeda monitors clusters continuously and routes alerts to your team’s designated Slack channel. This page lists alert rules, meanings, routing behavior, and expected follow-up.
Grafana memory dashboard with memory usage, OOM kills, huge pages, EDAC memory errors, swap usage, and memory pressure panels.

Alert tiers

Severity levels and examples.

GPU alerts

Thermal, ECC, NVLink, PCIe, row remap, and XID signals.

Node alerts

Readiness, reachability, cordon, drain, and pressure states.

Host alerts

CPU, memory, I/O, network, disk, and temperature pressure.

Telemetry health

Metrics and logs freshness checks.

Expected chains

Follow-on signals that should appear after hardware faults.

Alert tiers

TierSeverityMeaningExample
t0criticalCluster-wide emergency. Large-scale node loss or capacity collapse.ClusterMassLoss, ClusterNodeCapacityCritical
t1criticalHigh-impact job or capacity disruption. Immediate investigation.TenantNodeCapacityCritical, LargeSlurmJobNodeUnhealthy, GpuCriticalXid
t2warningDegradation trending. Not an emergency but needs attention.TenantNodeCapacityDegraded, GPUCorrectableEccErrorsHigh
t3infoIndividual node state changes. Informational.NodeCordoned, SlurmNodeDrained

Alert types

TypeBehavior
edgeFires once on state transition. No repeat. Used for hardware faults.
stateFires while condition persists. Repeats at long intervals (24h for t0/t1).
sustainedFires after a condition persists beyond a for duration, such as 10m or 15m.
aggregateFleet-level threshold, such as >=50% nodes lost.

Slack behavior

Alerts fire once per state change. Slack does not post a separate “resolved” message for every alert; use the Alerts Overview dashboard in Grafana for current status. When more than 5 events of the same type fire within 60 seconds, they are aggregated into a single Slack message.

Cluster and capacity alerts

AlertTierTypeConditionMeaning
ClusterMassLosst0aggregate>=50% GPU nodes gone AND >=5 missing, 2mCatastrophic cluster failure.
ClusterNodeCapacityCriticalt0aggregate>=10% nodes bad AND >=5, 2mSignificant cluster degradation.
ClusterNodeCapacityDegradedt2sustained>=5% bad AND >=3, 15mCluster health trending down.
TenantNodeCapacityCriticalt1state0 ready nodes with >0 assigned, 5mAll assigned nodes unavailable.
TenantNodeCapacityDegradedt2sustained>=10% bad AND >=2, 10mPart of your assigned capacity is degraded.
SlurmCapacityMismatcht3sustainedAssigned != accounted, 15mSlurm and Kubernetes disagree on node count. Review in Grafana (informational; not posted to Slack).
NodeCountDriftt3sustainedBelow 7-day peak, 1hNode count dropped. Review in Grafana (informational; not posted to Slack).

GPU alerts

AlertTierTypeConditionMeaning
GPUThermalThrottlet2edge, 1h dedupGPU temp > 83 °CGPU is thermal throttling. Performance degraded.
GPUMemoryThermalt2edge, 1h dedupHBM temp > 95 °CHBM overheating. Severe performance impact.
GPUNvlinkErrorst2edgeincrease(DCGM_FI_DEV_GPU_NVLINK_ERRORS[15m]) > 0NVLink degradation. Affects multi-GPU jobs.
GPUUncorrectableEccErrorst1edgeincrease(DCGM_FI_DEV_ECC_DBE_*[15m]) > 0Uncorrectable memory error. GPU will likely be drained.
GPUCorrectableEccErrorsHight2edgeincrease(DCGM_FI_DEV_ECC_SBE_*[1h]) > 10High rate of correctable errors. GPU degrading.
GPURowRemapFailuret1edgeDCGM_FI_DEV_ROW_REMAP_FAILURE > 0Row remapping failed. GPU needs replacement.
GPUCorrectableRemappedRowst3edgeRow remapped due to SBECorrectable error triggered a row remap. Informational.
GPUUncorrectableRemappedRowst2edgeRow remapped due to DBEUncorrectable error triggered a row remap.
GPUPcieReplayErrorsHight2edgeincrease(DCGM_FI_DEV_PCIE_REPLAY_COUNTER[15m]) > 50PCIe link instability.
GPUAllocatableDegradedt2stateGPU count dropped below 8 per nodeNode has fewer GPUs than expected.
GpuCriticalXidt1edgeCritical XID codes from host diagnosticsHardware GPU fault.

Node alerts

AlertTierTypeConditionMeaning
NodeNotReadyt2statekube_node_status_condition{Ready}==0Kubernetes considers the node unhealthy.
NodeUnreachablet1edge, 30m dedupNode unreachableNode is completely unreachable.
NodeCordonedt3state, 2mkube_node_spec_unschedulable==1Node marked unschedulable.
SlurmNodeDrainedt3stateSlurm drain state setNode pulled from Slurm scheduling.
SlurmNodeNotRespondingt2stateSlurm node not respondingSlurm cannot reach the node.
NodeNetworkUnavailablet2stateNetworkUnavailable conditionNode networking is down.
NodePIDPressuret2statePIDPressure conditionRunning out of process IDs.
NodeDiskPressuret2stateDiskPressure conditionDisk space critically low.
NodeMemoryPressuret2stateMemoryPressure conditionSystem memory exhausted.

Host alerts

AlertTierTypeConditionMeaning
HostMemoryPressureHight2sustained, 10mPSI memory waiting >0.5 OR stalled >0.2Memory thrashing. Jobs will slow down.
HostCPUPressureHight2sustained, 10mPSI CPU waiting >0.5Severe CPU contention. Data loading bottleneck likely.
HostIOPressureHight2sustained, 10mPSI I/O stalled >0.3Storage I/O blocking processes.
HostNetworkErrorsHight2edge, 6h dedupNetwork errors >10/sPersistent network errors.
HostNodeOvertemperatureAlarmt1edgeHardware thermal sensor alarmPhysical hardware overheating.
NodeDiskFillingUpt2sustainedPredicted to fill within windowDisk will fill at current rate.

Log-derived alerts

These alerts also evaluate host log signals in addition to metrics:
AlertTierTypeCatches
GpuCriticalXidt1edge46 known critical GPU XID error codes
NvlinkErrort1edgeNVLink errors in host diagnostics
Mlx5DriverErrort1edgeMellanox NIC driver errors
LinkDownt1edgeNetwork link down events
MachineCheckExceptiont1edgeCPU machine check exceptions (hardware fault)

Telemetry health

AlertTierConditionMeaning
ClusterMetricsStalet2No metrics from cluster for 5+ minMetrics are stale for this cluster—refresh Grafana; contact support if it persists.
ClusterLogsStalet2No logs for 15+ minHost logs are stale for this cluster—retry log workflows; contact support if it persists.

Alert routing

When an alert fires, it is posted to your designated Slack channel. Related signals on the same node may be grouped so your channel stays readable.

Expected alert chains

Hardware faults typically produce a sequence of alerts:
Root causeExpected consequenceWithin
GpuCriticalXidSlurmNodeDrained + NodeCordoned~2 min
NodeUnreachableLargeSlurmJobNodeUnhealthy~2 min
GPUUncorrectableEccErrorsSlurmNodeDrained~2 min
MachineCheckExceptionSlurmNodeDrained~2 min
If a root cause fires but the expected consequence does not appear within the expected window, escalate to Andromeda Support.