Alerts - Andromeda Product Docs

Andromeda monitors clusters continuously and routes alerts to your team’s designated Slack channel. This page lists alert rules, meanings, routing behavior, and expected follow-up.

Grafana memory dashboard with memory usage, OOM kills, huge pages, EDAC memory errors, swap usage, and memory pressure panels.

Alert tiers

Severity levels and examples.

GPU alerts

Thermal, ECC, NVLink, PCIe, row remap, and XID signals.

Node alerts

Readiness, reachability, cordon, drain, and pressure states.

Host alerts

CPU, memory, I/O, network, disk, and temperature pressure.

Telemetry health

Metrics and logs freshness checks.

Expected chains

Follow-on signals that should appear after hardware faults.

Alert tiers

Tier	Severity	Meaning	Example
t0	critical	Cluster-wide emergency. Large-scale node loss or capacity collapse.	`ClusterMassLoss`, `ClusterNodeCapacityCritical`
t1	critical	High-impact job or capacity disruption. Immediate investigation.	`TenantNodeCapacityCritical`, `LargeSlurmJobNodeUnhealthy`, `GpuCriticalXid`
t2	warning	Degradation trending. Not an emergency but needs attention.	`TenantNodeCapacityDegraded`, `GPUCorrectableEccErrorsHigh`
t3	info	Individual node state changes. Informational.	`NodeCordoned`, `SlurmNodeDrained`

Alert types

Type	Behavior
edge	Fires once on state transition. No repeat. Used for hardware faults.
state	Fires while condition persists. Repeats at long intervals (24h for t0/t1).
sustained	Fires after a condition persists beyond a `for` duration, such as 10m or 15m.
aggregate	Fleet-level threshold, such as >=50% nodes lost.

Slack behavior

Alerts fire once per state change. Slack does not post a separate “resolved” message for every alert; use the Alerts Overview dashboard in Grafana for current status. When more than 5 events of the same type fire within 60 seconds, they are aggregated into a single Slack message.

Cluster and capacity alerts

Alert	Tier	Type	Condition	Meaning
`ClusterMassLoss`	t0	aggregate	>=50% GPU nodes gone AND >=5 missing, 2m	Catastrophic cluster failure.
`ClusterNodeCapacityCritical`	t0	aggregate	>=10% nodes bad AND >=5, 2m	Significant cluster degradation.
`ClusterNodeCapacityDegraded`	t2	sustained	>=5% bad AND >=3, 15m	Cluster health trending down.
`TenantNodeCapacityCritical`	t1	state	0 ready nodes with >0 assigned, 5m	All assigned nodes unavailable.
`TenantNodeCapacityDegraded`	t2	sustained	>=10% bad AND >=2, 10m	Part of your assigned capacity is degraded.
`SlurmCapacityMismatch`	t3	sustained	Assigned != accounted, 15m	Slurm and Kubernetes disagree on node count. Review in Grafana (informational; not posted to Slack).
`NodeCountDrift`	t3	sustained	Below 7-day peak, 1h	Node count dropped. Review in Grafana (informational; not posted to Slack).

GPU alerts

Alert	Tier	Type	Condition	Meaning
`GPUThermalThrottle`	t2	edge, 1h dedup	GPU temp > 83 °C	GPU is thermal throttling. Performance degraded.
`GPUMemoryThermal`	t2	edge, 1h dedup	HBM temp > 95 °C	HBM overheating. Severe performance impact.
`GPUNvlinkErrors`	t2	edge	`increase(DCGM_FI_DEV_GPU_NVLINK_ERRORS[15m]) > 0`	NVLink degradation. Affects multi-GPU jobs.
`GPUUncorrectableEccErrors`	t1	edge	`increase(DCGM_FI_DEV_ECC_DBE_*[15m]) > 0`	Uncorrectable memory error. GPU will likely be drained.
`GPUCorrectableEccErrorsHigh`	t2	edge	`increase(DCGM_FI_DEV_ECC_SBE_*[1h]) > 10`	High rate of correctable errors. GPU degrading.
`GPURowRemapFailure`	t1	edge	`DCGM_FI_DEV_ROW_REMAP_FAILURE > 0`	Row remapping failed. GPU needs replacement.
`GPUCorrectableRemappedRows`	t3	edge	Row remapped due to SBE	Correctable error triggered a row remap. Informational.
`GPUUncorrectableRemappedRows`	t2	edge	Row remapped due to DBE	Uncorrectable error triggered a row remap.
`GPUPcieReplayErrorsHigh`	t2	edge	`increase(DCGM_FI_DEV_PCIE_REPLAY_COUNTER[15m]) > 50`	PCIe link instability.
`GPUAllocatableDegraded`	t2	state	GPU count dropped below 8 per node	Node has fewer GPUs than expected.
`GpuCriticalXid`	t1	edge	Critical XID codes from host diagnostics	Hardware GPU fault.

Node alerts

Alert	Tier	Type	Condition	Meaning
`NodeNotReady`	t2	state	`kube_node_status_condition{Ready}==0`	Kubernetes considers the node unhealthy.
`NodeUnreachable`	t1	edge, 30m dedup	Node unreachable	Node is completely unreachable.
`NodeCordoned`	t3	state, 2m	`kube_node_spec_unschedulable==1`	Node marked unschedulable.
`SlurmNodeDrained`	t3	state	Slurm drain state set	Node pulled from Slurm scheduling.
`SlurmNodeNotResponding`	t2	state	Slurm node not responding	Slurm cannot reach the node.
`NodeNetworkUnavailable`	t2	state	NetworkUnavailable condition	Node networking is down.
`NodePIDPressure`	t2	state	PIDPressure condition	Running out of process IDs.
`NodeDiskPressure`	t2	state	DiskPressure condition	Disk space critically low.
`NodeMemoryPressure`	t2	state	MemoryPressure condition	System memory exhausted.

Host alerts

Alert	Tier	Type	Condition	Meaning
`HostMemoryPressureHigh`	t2	sustained, 10m	PSI memory waiting >0.5 OR stalled >0.2	Memory thrashing. Jobs will slow down.
`HostCPUPressureHigh`	t2	sustained, 10m	PSI CPU waiting >0.5	Severe CPU contention. Data loading bottleneck likely.
`HostIOPressureHigh`	t2	sustained, 10m	PSI I/O stalled >0.3	Storage I/O blocking processes.
`HostNetworkErrorsHigh`	t2	edge, 6h dedup	Network errors >10/s	Persistent network errors.
`HostNodeOvertemperatureAlarm`	t1	edge	Hardware thermal sensor alarm	Physical hardware overheating.
`NodeDiskFillingUp`	t2	sustained	Predicted to fill within window	Disk will fill at current rate.

Log-derived alerts

These alerts also evaluate host log signals in addition to metrics:

Alert	Tier	Type	Catches
`GpuCriticalXid`	t1	edge	46 known critical GPU XID error codes
`NvlinkError`	t1	edge	NVLink errors in host diagnostics
`Mlx5DriverError`	t1	edge	Mellanox NIC driver errors
`LinkDown`	t1	edge	Network link down events
`MachineCheckException`	t1	edge	CPU machine check exceptions (hardware fault)

Telemetry health

Alert	Tier	Condition	Meaning
`ClusterMetricsStale`	t2	No metrics from cluster for 5+ min	Metrics are stale for this cluster—refresh Grafana; contact support if it persists.
`ClusterLogsStale`	t2	No logs for 15+ min	Host logs are stale for this cluster—retry log workflows; contact support if it persists.

Alert routing

When an alert fires, it is posted to your designated Slack channel. Related signals on the same node may be grouped so your channel stays readable.

Expected alert chains

Hardware faults typically produce a sequence of alerts:

Root cause	Expected consequence	Within
`GpuCriticalXid`	`SlurmNodeDrained` + `NodeCordoned`	~2 min
`NodeUnreachable`	`LargeSlurmJobNodeUnhealthy`	~2 min
`GPUUncorrectableEccErrors`	`SlurmNodeDrained`	~2 min
`MachineCheckException`	`SlurmNodeDrained`	~2 min

If a root cause fires but the expected consequence does not appear within the expected window, escalate to Andromeda Support.

Observability & Alerting

Documentation Index

Alert tiers

GPU alerts

Node alerts

Host alerts

Telemetry health

Expected chains

​Alert tiers

​Alert types

​Slack behavior

​Cluster and capacity alerts

​GPU alerts

​Node alerts

​Host alerts

​Log-derived alerts

​Telemetry health

​Alert routing

​Expected alert chains

Alert tiers

Alert types

Slack behavior

Cluster and capacity alerts

GPU alerts

Node alerts

Host alerts

Log-derived alerts

Telemetry health

Alert routing

Expected alert chains