Meta-Monitoring Architecture

You could configure a Prometheus server to scrape and monitor its own metrics, but that is brittle: as soon as that Prometheus server is experiencing issues that would prevent it from correctly collecting data, calculating alerting rules, or delivering the resulting alerts, your meta-monitoring will break as well.

The entire point of meta-monitoring is to let you know when your monitoring is broken, so we recommend deploying a dedicated highly available Prometheus server pair (see also our High Availability for Monitoring and Alerting training) whose only job it is to monitor your other production Prometheus servers and Alertmanager clusters. That way, it is less likely to suffer from the same problems as your main monitoring infrastructure:

Meta-monitoring architecture diagram

You can then configure alerting rulesets based on the collected metrics to let you know if your main Prometheus or Alertmanager servers are down or experiencing other issues.

NOTE: Whether you also set up a dedicated Alertmanager cluster for meta-monitoring is up to you. As we will see in the section about watchdog alerts, there is still a last resort way of knowing when your monitoring pipeline is completely broken, even if the Alertmanager is not working anymore.

Training Overview
Introduction
Prerequisites

Monitoring and Debugging Overview
Motivation
Overview

Review and Outlook
Summary
Further Steps