End-to-End Watchdog Alerts

Even if you build the most reliable internal meta-monitoring infrastructure, something can still go wrong: the meta-monitoring setup might have a systematic error that causes it to break, the Alertmanager through which notifications are supposed to get delivered to you might be down, or your entire organization's network might be disconnected from the internet!

To help you catch these kinds of situations as a last resort, it's a good idea to set up a watchdog alert (also sometimes called a "dead man's switch", sentinel alert, beacon alert, or heartbeat alert) that helps you continuously test your entire alerting pipeline from beginning to end.

You can achieve this by:

  • Configuring an alerting rule that is always firing, e.g. using the PromQL expression vector(1), or simply 1 (the numeric value doesn't matter) to always produce a single output time series.
  • Configuring Alertmanager to route the resulting alert notification to an external service with a repeat interval of a only a few (x) minutes.
  • Configuring the external service to expect an incoming alert notification every x minutes (plus some slack time) or to send you an alert notification otherwise.

When anything in your alerting pipeline breaks, the external service will then notice that no more alert notifications are coming in and will send you a notification about it.

Here is a real-world example of a watchdog alert, as it is included in the kube-prometheus project's example meta-monitoring rules:

    - alert: Watchdog
        description: |
          This is an alert meant to ensure that the entire alerting pipeline is functional.
          This alert is always firing, therefore it should always be firing in Alertmanager
          and always fire against a receiver. There are integrations with various notification
          mechanisms that send a notification when this alert is not firing. For example the
          "DeadMansSnitch" integration in PagerDuty.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/watchdog
        summary: An alert that should always be firing to certify that Alertmanager
          is working properly.
      expr: vector(1)
        severity: none

To handle watchdog alerts like this with an on-call notification service like PagerDuty, see PagerDuty's Dead Man's Snitch integration guide.