End-to-End Watchdog Alerts
Even if you build the most reliable internal meta-monitoring infrastructure, something can still go wrong: the meta-monitoring setup might have a systematic error that causes it to break, the Alertmanager through which notifications are supposed to get delivered to you might be down, or your entire organization's network might be disconnected from the internet!
To help you catch these kinds of situations as a last resort, it's a good idea to set up a watchdog alert (also sometimes called a "dead man's switch", sentinel alert, beacon alert, or heartbeat alert) that helps you continuously test your entire alerting pipeline from beginning to end.
You can achieve this by:
- Configuring an alerting rule that is always firing, e.g. using the PromQL expression
vector(1), or simply
1(the numeric value doesn't matter) to always produce a single output time series.
- Configuring Alertmanager to route the resulting alert notification to an external service with a repeat interval of a only a few (
- Configuring the external service to expect an incoming alert notification every
xminutes (plus some slack time) or to send you an alert notification otherwise.
When anything in your alerting pipeline breaks, the external service will then notice that no more alert notifications are coming in and will send you a notification about it.
Here is a real-world example of a watchdog alert, as it is included in the
kube-prometheus project's example meta-monitoring rules:
- alert: Watchdog annotations: description: | This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the "DeadMansSnitch" integration in PagerDuty. runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/watchdog summary: An alert that should always be firing to certify that Alertmanager is working properly. expr: vector(1) labels: severity: none
To handle watchdog alerts like this with an on-call notification service like PagerDuty, see PagerDuty's Dead Man's Snitch integration guide.