Integrated Alerting

Prometheus integrates collecting and processing of time series data with an active alerting system. Its philosophy is to collect as much data about your systems as possible in a single data model so that you can then formulate integrated queries over it. The same query language that is used for ad-hoc queries and dashboarding is also used to define alerting rules. This is in contrast to a historical split between fault-detection systems like Nagios, which run periodic check scripts and keep little historical data, and standalone time series databases that store metrics.

For example, the following alerting rule (loaded into Prometheus as part of a rule configuration file) would alert you if the number of HTTP requests that resulted in a 500 status code exceeded 5% of the total traffic for a given path:

alert: Many500Errors
# This is the PromQL expression that forms the "heart" of the alerting rule.
expr: |
      sum by(path) (rate(http_requests_total{status="500"}[5m]))
      sum by(path) (rate(http_requests_total[5m]))
  ) * 100 > 5
for: 5m
  severity: "critical"
  summary: "Many 500 errors for path {{$labels.path}} ({{$value}}%)"

The PromQL expression in the expr field forms the core of an alerting rule, while additional YAML-based configuration options allow controlling alert metadata, routing labels, and more. This enables precise and accurate alerting based on collected data.