Documentation Index
Fetch the complete documentation index at: https://docs.andromeda.ai/llms.txt
Use this file to discover all available pages before exploring further.
Andromeda monitors clusters continuously and routes alerts to your team’s designated Slack channel. This page lists alert rules, meanings, routing behavior, and expected follow-up.
Alert tiers
Severity levels and examples.
GPU alerts
Thermal, ECC, NVLink, PCIe, row remap, and XID signals.
Node alerts
Readiness, reachability, cordon, drain, and pressure states.
Host alerts
CPU, memory, I/O, network, disk, and temperature pressure.
Telemetry health
Metrics and logs freshness checks.
Expected chains
Follow-on signals that should appear after hardware faults.
Alert tiers
| Tier | Severity | Meaning | Example |
|---|
| t0 | critical | Cluster-wide emergency. Large-scale node loss or capacity collapse. | ClusterMassLoss, ClusterNodeCapacityCritical |
| t1 | critical | High-impact job or capacity disruption. Immediate investigation. | TenantNodeCapacityCritical, LargeSlurmJobNodeUnhealthy, GpuCriticalXid |
| t2 | warning | Degradation trending. Not an emergency but needs attention. | TenantNodeCapacityDegraded, GPUCorrectableEccErrorsHigh |
| t3 | info | Individual node state changes. Informational. | NodeCordoned, SlurmNodeDrained |
Alert types
| Type | Behavior |
|---|
| edge | Fires once on state transition. No repeat. Used for hardware faults. |
| state | Fires while condition persists. Repeats at long intervals (24h for t0/t1). |
| sustained | Fires after a condition persists beyond a for duration, such as 10m or 15m. |
| aggregate | Fleet-level threshold, such as >=50% nodes lost. |
Slack behavior
Alerts fire once per state change. Slack does not post a separate “resolved” message for every alert; use the Alerts Overview dashboard in Grafana for current status.
When more than 5 events of the same type fire within 60 seconds, they are aggregated into a single Slack message.
Cluster and capacity alerts
| Alert | Tier | Type | Condition | Meaning |
|---|
ClusterMassLoss | t0 | aggregate | >=50% GPU nodes gone AND >=5 missing, 2m | Catastrophic cluster failure. |
ClusterNodeCapacityCritical | t0 | aggregate | >=10% nodes bad AND >=5, 2m | Significant cluster degradation. |
ClusterNodeCapacityDegraded | t2 | sustained | >=5% bad AND >=3, 15m | Cluster health trending down. |
TenantNodeCapacityCritical | t1 | state | 0 ready nodes with >0 assigned, 5m | All assigned nodes unavailable. |
TenantNodeCapacityDegraded | t2 | sustained | >=10% bad AND >=2, 10m | Part of your assigned capacity is degraded. |
SlurmCapacityMismatch | t3 | sustained | Assigned != accounted, 15m | Slurm and Kubernetes disagree on node count. Review in Grafana (informational; not posted to Slack). |
NodeCountDrift | t3 | sustained | Below 7-day peak, 1h | Node count dropped. Review in Grafana (informational; not posted to Slack). |
GPU alerts
| Alert | Tier | Type | Condition | Meaning |
|---|
GPUThermalThrottle | t2 | edge, 1h dedup | GPU temp > 83 °C | GPU is thermal throttling. Performance degraded. |
GPUMemoryThermal | t2 | edge, 1h dedup | HBM temp > 95 °C | HBM overheating. Severe performance impact. |
GPUNvlinkErrors | t2 | edge | increase(DCGM_FI_DEV_GPU_NVLINK_ERRORS[15m]) > 0 | NVLink degradation. Affects multi-GPU jobs. |
GPUUncorrectableEccErrors | t1 | edge | increase(DCGM_FI_DEV_ECC_DBE_*[15m]) > 0 | Uncorrectable memory error. GPU will likely be drained. |
GPUCorrectableEccErrorsHigh | t2 | edge | increase(DCGM_FI_DEV_ECC_SBE_*[1h]) > 10 | High rate of correctable errors. GPU degrading. |
GPURowRemapFailure | t1 | edge | DCGM_FI_DEV_ROW_REMAP_FAILURE > 0 | Row remapping failed. GPU needs replacement. |
GPUCorrectableRemappedRows | t3 | edge | Row remapped due to SBE | Correctable error triggered a row remap. Informational. |
GPUUncorrectableRemappedRows | t2 | edge | Row remapped due to DBE | Uncorrectable error triggered a row remap. |
GPUPcieReplayErrorsHigh | t2 | edge | increase(DCGM_FI_DEV_PCIE_REPLAY_COUNTER[15m]) > 50 | PCIe link instability. |
GPUAllocatableDegraded | t2 | state | GPU count dropped below 8 per node | Node has fewer GPUs than expected. |
GpuCriticalXid | t1 | edge | Critical XID codes from host diagnostics | Hardware GPU fault. |
Node alerts
| Alert | Tier | Type | Condition | Meaning |
|---|
NodeNotReady | t2 | state | kube_node_status_condition{Ready}==0 | Kubernetes considers the node unhealthy. |
NodeUnreachable | t1 | edge, 30m dedup | Node unreachable | Node is completely unreachable. |
NodeCordoned | t3 | state, 2m | kube_node_spec_unschedulable==1 | Node marked unschedulable. |
SlurmNodeDrained | t3 | state | Slurm drain state set | Node pulled from Slurm scheduling. |
SlurmNodeNotResponding | t2 | state | Slurm node not responding | Slurm cannot reach the node. |
NodeNetworkUnavailable | t2 | state | NetworkUnavailable condition | Node networking is down. |
NodePIDPressure | t2 | state | PIDPressure condition | Running out of process IDs. |
NodeDiskPressure | t2 | state | DiskPressure condition | Disk space critically low. |
NodeMemoryPressure | t2 | state | MemoryPressure condition | System memory exhausted. |
Host alerts
| Alert | Tier | Type | Condition | Meaning |
|---|
HostMemoryPressureHigh | t2 | sustained, 10m | PSI memory waiting >0.5 OR stalled >0.2 | Memory thrashing. Jobs will slow down. |
HostCPUPressureHigh | t2 | sustained, 10m | PSI CPU waiting >0.5 | Severe CPU contention. Data loading bottleneck likely. |
HostIOPressureHigh | t2 | sustained, 10m | PSI I/O stalled >0.3 | Storage I/O blocking processes. |
HostNetworkErrorsHigh | t2 | edge, 6h dedup | Network errors >10/s | Persistent network errors. |
HostNodeOvertemperatureAlarm | t1 | edge | Hardware thermal sensor alarm | Physical hardware overheating. |
NodeDiskFillingUp | t2 | sustained | Predicted to fill within window | Disk will fill at current rate. |
Log-derived alerts
These alerts also evaluate host log signals in addition to metrics:
| Alert | Tier | Type | Catches |
|---|
GpuCriticalXid | t1 | edge | 46 known critical GPU XID error codes |
NvlinkError | t1 | edge | NVLink errors in host diagnostics |
Mlx5DriverError | t1 | edge | Mellanox NIC driver errors |
LinkDown | t1 | edge | Network link down events |
MachineCheckException | t1 | edge | CPU machine check exceptions (hardware fault) |
Telemetry health
| Alert | Tier | Condition | Meaning |
|---|
ClusterMetricsStale | t2 | No metrics from cluster for 5+ min | Metrics are stale for this cluster—refresh Grafana; contact support if it persists. |
ClusterLogsStale | t2 | No logs for 15+ min | Host logs are stale for this cluster—retry log workflows; contact support if it persists. |
Alert routing
When an alert fires, it is posted to your designated Slack channel. Related signals on the same node may be grouped so your channel stays readable.
Expected alert chains
Hardware faults typically produce a sequence of alerts:
| Root cause | Expected consequence | Within |
|---|
GpuCriticalXid | SlurmNodeDrained + NodeCordoned | ~2 min |
NodeUnreachable | LargeSlurmJobNodeUnhealthy | ~2 min |
GPUUncorrectableEccErrors | SlurmNodeDrained | ~2 min |
MachineCheckException | SlurmNodeDrained | ~2 min |
If a root cause fires but the expected consequence does not appear within the expected window, escalate to Andromeda Support.