上手 Promethus – 开源监控、报警工具包

本文介绍: Prom eth eus 中的 Alert in g(报警) 分为两部分1）Prom etheus servers 中的 Alert ing rules 将 alerts 发送给 Alert manager2）之后，Alert manager 管理这些 alerts alerts 包括：si len cing, in h i bit i on, agg reg a t i on以及通过 ema il, on–call no tification systems（呼叫通知系统）和聊天平台等方式，发送通知。

开源的【系统监控和警报】工具包

专注于：
1）可靠的实时监控
2）收集时间序列数据
3）提供强大的查询语言（PromQL），用于分析这些数据

功能：
1）【监控】各种资源、服务和应用程序的性能指标
2）支持多维数据模型和灵活的查询语言，从而 -> 用户，可以轻松地获取他们关心的信息

在Jav a 生态系统中，Spr ing Boot 提供了 Actuator 模块，用于【监控和管理】应用程序

举例

1）监控应用程序健康状况：
Actuator 提供了 /actuator/healt h 端点，用于检查应用程序的健康状态。通过这个端点，你可以了解应用程序是否运行正常、数据库连接是否正常等。

AlertManager 是由Prometheus社区开发的一个组件
 用于处理：Prometheus 监控系统 生成的警报。它能够管理和路由警报，发送通知以及对警报进行抑制和静音

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # 当与外部系统通信时，将这些标签，附加到所有：【time series(时间序列)】或【alert(警报)】上
  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'


# 一个抓取配置，在这里，仅包含一个要抓取的端点，就是 Prometheus 自身
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'


	# 覆盖全局的（第二行的那个）scrape_interval
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

prometheus_target_interval_length_seconds（目标，抓取，时间间隔）

avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

groups:
- name: cpu-node
  rules:
  - record: job_instance_mode:node_cpu_seconds:avg_rate5m
    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))