SLOs, Error windows and alerts are complicated. Here an attempt to make it easy
APM for NodeJS using Prometheus
Alert rules toolkit for Prometheus. Connect Prometheus, discover alert rules, apply!
Do more with your metrics
A curated list of AI-powered DevOps & SRE (Site Reliability Engineering) agents, tools, and resources for automating and enhancing reliability practices
GPU telemetry with workload attribution. One OTLP agent per node ties hardware metrics (NVIDIA, AMD, Intel Gaudi) to the K8s pod or Slurm job burning the GPU — so you know who's paying for that idle H100.
Tells you which code fired that query. Zero config.
Zero-code OpenTelemetry auto-instrumentation for Vert.x 4 + RxJava 3 and Vert.x 3 + RxJava 2. Distributed tracing, log-to-trace correlation, and RxJava context propagation.