Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.andromeda.ai/llms.txt

Use this file to discover all available pages before exploring further.

Grafana access

Sign in at grafana.andromedacluster.xyz with Andromeda SSO credentials. The session lands in the Tenants org with the Viewer role. Three dashboards are available, pre-filtered to your assigned nodes and namespaces. You only see metrics and workloads for your organization’s reserved capacity.
  • GPU Nodes - GPU utilization, temperature, power, ECC, memory, node CPU/memory
  • Job Analysis - Slurm job state, GPU/CPU allocation, node mapping
  • Tenant Dashboard - capacity overview, node readiness, reservation status
Grafana reservation summary showing reserved nodes, reserved GPUs, and reserved clusters.
Grafana node info panels showing hostname, cluster, OS type, GPU inventory, uptime, and active alerts.
For panel-level detail, see Dashboards. For custom dashboards, additional metrics, or a direct feed to an external monitoring stack, contact Andromeda Support.

Metric naming

Some node-level metrics use a tenant_ prefix to scope them to your assigned capacity:
PrefixCoverage
tenant_node_*CPU, memory, load, pressure, thermals, EDAC
tenant_kube_node_*Kubernetes node conditions, cordon state
tenant_weka_*Weka storage where deployed
tenant_ib_perfquery_*InfiniBand fabric counters
GPU metrics (DCGM_FI_*), container metrics, and Slurm metrics use their standard names and are already scoped by namespace and node assignment. Full list in Metrics Reference.
Grafana Explore view with PromQL queries and time-series output.

Permissions

Pre-built dashboards are read-only for all users. Your team can ask Andromeda Support to enable ad-hoc Explore queries or dashboard editing when you need those capabilities.
ActionAllowed
View pre-built dashboardsYes
Use Explore, ad-hoc queriesNo
Create, edit, or delete dashboardsNo, planned
Query raw node_* or kube_node_*No, use tenant_node_* / tenant_kube_node_*