Kubernetes alerting with Prometheus

7 October, 2020

NWS Kubernetes Tutorial Alerting Prometheus

Gabriel Hartmann

Senior Systems Engineer

Gabriel hat 2016 als Auszubildender Fachinformatiker für Systemintegrator bei NETWAYS angefangen und 2019 die Ausbildung abgeschlossen. Als Mitglied des Web Services Teams kümmert er sich seither um viele technische Themen, die mit den NETWAYS Web Services und der Weiterentwicklung der Plattform zu tun haben. Aber auch im Support engagiert er sich, um den Kunden von NWS bei Fragen und Problemen zur Seite zu stehen.

by Gabriel Hartmann | Oct 7, 2020

Kubernetes Tutorial

In a previous tutorial , Sebastian explained how you can monitor your Kubernetes cluster with the Prometheus Operator. This article builds on this and shows how you can set up notifications by email and as push notifications with the Alertmanager.

Installing the monitoring stack with Helm

In addition to the method shown by Sebastian for deploying the Prometheus Operator, there is another variant that can be used to set up a complete Prometheus stack in the Kubernetes cluster. This is with the help of the Helm package manager. The sources of the Helm chart for the Prometheus stack can be viewed on Github. There you will also find a file containing all the configuration options and default values of the monitoring stack. In addition to the Prometheus Operator and Grafana, the components also include the Prometheus Node Exporter, Kube State Metrics and the Alert Manager. Kube State Metrics and the Alertmanager are new at this point. Kube State Metrics binds to the Kubernetes API and can therefore query metrics about all resources in the cluster. The alert manager can issue alerts for selected metrics based on a set of rules. During the setup per Helm, each component can be configured via the options in the Values file. It makes sense to use a PVC for Prometheus and Alertmanager in order to have the metrics and history of alarms as well as the alarm suppressions persistent. In my case, a Persistent Volume Claim (PVC) is created for both with the following values. The name of the storage class can of course vary.

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: nws-storage
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: nws-storage
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: nws-storage
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: nws-storage
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

I save the configuration in the “prom-config” directory as the “stack.values” file. Now you can deploy the stack. First you have to integrate the appropriate repos:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

The stack is then created:

helm install nws-prometheus-stack prometheus-community/kube-prometheus-stack -f prom-config/stack.values

helm install nws-prometheus-stack prometheus-community/kube-prometheus-stack -f prom-config/stack.values

Using port forwarding, you can call up the Grafana web interface via browser and log in with the credentials admin/prom-operator.

kubectl port-forward service/nws-prometheus-stack-grafana 3002:80

kubectl port-forward service/nws-prometheus-stack-grafana 3002:80

Then open http://localhost:3002/ in the browser.

Some predefined dashboards are already included. However, in order to display metrics for all dashboards (scheduler, Etcd, etc.), additional options must be set in the Prometheus values. In this post, however, I will not go into how these can be configured. Instead, as mentioned at the beginning, I would like to demonstrate how to set up alerts. This is where the alert manager comes into play. However, you will look in vain for it in Grafana. A look at the Prometheus Web UI makes more sense. So first set up a port forward for it again:

kubectl port-forward service/nws-prometheus-stack-kube-prometheus 3003:9090

kubectl port-forward service/nws-prometheus-stack-kube-prometheus 3003:9090

Then call up http://localhost:3003/ in the browser.

Under Alerts, you can see all the alert rules that are already preconfigured in the stack. By clicking on any rule, you can see its definition. However, nothing can be configured here either. The “Inactive”, “Pending” and “Firing” buttons can be used to show or hide the rules with the respective status. We go to the next web interface – Alertmanager. We also need a port forward for this.

kubectl port-forward service/nws-prometheus-stack-kube-alertmanager 3004:9093

kubectl port-forward service/nws-prometheus-stack-kube-alertmanager 3004:9093

The web interface (accessible at http://localhost:3004/) is similar to that of Prometheus. Under Alerts, however, you have a history of alerts that have already been fired. Silences shows an overview of the suppressed alerts and Status contains version information and the operating status as well as the current config of the alert manager. Unfortunately, this config cannot be changed here either. So how do we configure the alert manager?

Configuring the alert manager

Some will have already guessed it: by adjusting the values of the helmet chart. So we write a new values file that only contains the alert manager config. The documentation shows us which options are available and how they can be set. We start with a basic example and first add global SMTP settings: alertmanager-v1.values

alertmanager:
  config:
    global:
      resolve_timeout: 5m
      smtp_from: k8s-alertmanager@example.com
      smtp_smarthost: mail.example.com:587
      smtp_require_tls: true
      smtp_auth_username: k8s-alertmanager@example.com
      smtp_auth_password: xxxxxxxxx
    route:
      receiver: 'k8s-admin'
      routes: []
    receivers:
    - name: 'k8s-admin'
      email_configs:
      - to: k8s-admin@example.com

alertmanager:
  config:
    global:
      resolve_timeout: 5m
      smtp_from: k8s-alertmanager@example.com
      smtp_smarthost: mail.example.com:587
      smtp_require_tls: true
      smtp_auth_username: k8s-alertmanager@example.com
      smtp_auth_password: xxxxxxxxx
    route:
      receiver: 'k8s-admin'
      routes: []
    receivers:
    - name: 'k8s-admin'
      email_configs:
      - to: k8s-admin@example.com

This allows the alert manager to send out emails. Of course, it is also necessary to define at least one recipient. Under “receivers” you can simply enter the name and e-mail address, as in my example. Please make sure that the text indentation is correct, otherwise there may be problems starting the alert manager when the config is rolled out. To ensure that the alerts are actually delivered, the routes must be defined under “route”. If you first want to send all firing alerts to a contact, you can do this as in my example. We’ll take a closer look at what else you can do with routes in a later example. With a helmet upgrade, we can deploy our new configuration for the alert manager. We adopt all the values already set using “-reuse-values”:

helm upgrade --reuse-values -f prom-config/alertmanager-v1.values nws-prometheus-stack prometheus-community/kube-prometheus-stack

helm upgrade --reuse-values -f prom-config/alertmanager-v1.values nws-prometheus-stack prometheus-community/kube-prometheus-stack

And how do we test this now?

Test alerts

If you don’t want to shoot down a node straight away, you can simply restart the alert manager. The alert rule “Watchdog” fires at startup. This is only there to test the proper functioning of the alerting pipeline. You can restart it in this way:

kubectl rollout restart statefulset.apps/alertmanager-nws-prometheus-stack-kube-alertmanager

kubectl rollout restart statefulset.apps/alertmanager-nws-prometheus-stack-kube-alertmanager

The e-mail should arrive shortly after the restart. If not, first check whether the alert manager is still starting. If it is stuck in a crash loop, you can use the pod’s logs to find out what is going wrong. If a config option is set incorrectly, you can adjust it again in the values and deploy again with a Helm upgrade. If you have mistyped an option key, you can also roll back to a previous state using the Helm rollback:

helm rollback nws-prometheus-stack 1

helm rollback nws-prometheus-stack 1

This gives us a rudimentary monitoring system for the Kubernetes cluster itself. Of course, you can also store additional receivers and thus notify several contacts.

Add your own metrics and alert rules

Now let’s take a look at what the routes can be useful for. However, we will first boot up a few pods for testing and create various namespaces. We will also take a quick look at the Prometheus Blackbox Exporter. Scenario: In a K8s cluster, different environments are operated via namespaces – e.g. development and production. if the sample app of the development namespace fails, the standby team should not be alerted. the standby team is only responsible for failures of the sample app of the production namespace and receives notifications on the cell phone in addition to emails. The developers are informed separately by e-mail about problems with the sample app from the development namespace. We define the two namespaces nws-production and nws-development in namespaces-nws.yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: nws-development
  labels:
    name: nws-development
---
apiVersion: v1
kind: Namespace
metadata:
  name: nws-production
  labels:
    name: nws-production

apiVersion: v1
kind: Namespace
metadata:
  name: nws-development
  labels:
    name: nws-development
---
apiVersion: v1
kind: Namespace
metadata:
  name: nws-production
  labels:
    name: nws-production

kubectl apply -f ./prom-config/namespaces-nws.yaml

kubectl apply -f ./prom-config/namespaces-nws.yaml

Now we start two sample-apps that alternately return HTTP 200 and HTTP 500 in a 60 second interval. I am using a simple image that I created for this purpose(sources on GitHub). sample-app.yaml :

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nws-sample-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nws-sample-app
  template:
    metadata:
      labels:
        app: nws-sample-app
    spec:
      containers:
      - name: nws-sample-app
        image: gagaha/alternating-http-response
        ports:
        - containerPort: 80

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nws-sample-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nws-sample-app
  template:
    metadata:
      labels:
        app: nws-sample-app
    spec:
      containers:
      - name: nws-sample-app
        image: gagaha/alternating-http-response
        ports:
        - containerPort: 80

kubectl apply -f prom-config/sample-app.yaml -n nws-production
kubectl apply -f prom-config/sample-app.yaml -n nws-development

kubectl apply -f prom-config/sample-app.yaml -n nws-production
kubectl apply -f prom-config/sample-app.yaml -n nws-development

We then expose the app in the cluster:

kubectl expose deployment nws-sample-app -n nws-production
kubectl expose deployment nws-sample-app -n nws-development

kubectl expose deployment nws-sample-app -n nws-production
kubectl expose deployment nws-sample-app -n nws-development

However, we now need a component that can query the availability of these apps via HTTP request and provide them as metrics for Prometheus. The Prometheus Blackbox Exporter is ideal for this. In addition to HTTP/HTTPS requests, it can also be used to check connections with the DNS, TCP and ICMP protocols. First we have to deploy the Blackbox Exporter in the cluster. Again, I will use the official Helm Chart.

helm install nws-blackbox-exporter prometheus-community/prometheus-blackbox-exporter

helm install nws-blackbox-exporter prometheus-community/prometheus-blackbox-exporter

Now we have to tell Prometheus how to get the black box exporter. The targets are interesting here. Here we enter the HTTP endpoints that are to be queried by the blackbox exporter. We write the additional configuration in a file and deploy it via Helm Upgrade: prom-blackbox-scrape.yaml

prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
     - job_name: 'nws-blackbox-exporter'
       metrics_path: /probe
       params:
         module: [http_2xx]
       static_configs:
        - targets:
           - http://nws-sample-app.nws-production.svc
           - http://nws-sample-app.nws-development.svc
       relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: nws-blackbox-exporter-prometheus-blackbox-exporter:9115

prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
     - job_name: 'nws-blackbox-exporter'
       metrics_path: /probe
       params:
         module: [http_2xx]
       static_configs:
        - targets:
           - http://nws-sample-app.nws-production.svc
           - http://nws-sample-app.nws-development.svc
       relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: nws-blackbox-exporter-prometheus-blackbox-exporter:9115

helm upgrade --reuse-values -f prom-config/prom-blackbox-scrape.yaml nws-prometheus-stack prometheus-community/kube-prometheus-stack

helm upgrade --reuse-values -f prom-config/prom-blackbox-scrape.yaml nws-prometheus-stack prometheus-community/kube-prometheus-stack

If we then start the port forward for prometheus again, we can see the two new targets of the nws-blackbox-exporter at http://localhost:3003/targets. This means that metrics for Prometheus are now available. However, we also need to define new alert rules so that alerts can be sent for these metrics. We edit the rules directly via Kubectl:

kubectl edit prometheusrules nws-prometheus-stack-kube-k8s.rules

kubectl edit prometheusrules nws-prometheus-stack-kube-k8s.rules

We add our new rule before the k8s.rules:

...
spec:
  groups:
  - name: blackbox-exporter
    rules:
    - alert: HttpStatusCode
      annotations:
        description: |-
          HTTP status code is not 200-399
            VALUE = {{ $value }}
            LABELS: {{ $labels }}
        summary: HTTP Status Code (instance {{ $labels.instance >
      expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
      for: 30s
      labels:
        severity: error
  - name: k8s.rules
    rules:
...

...
spec:
  groups:
  - name: blackbox-exporter
    rules:
    - alert: HttpStatusCode
      annotations:
        description: |-
          HTTP status code is not 200-399
            VALUE = {{ $value }}
            LABELS: {{ $labels }}
        summary: HTTP Status Code (instance {{ $labels.instance >
      expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
      for: 30s
      labels:
        severity: error
  - name: k8s.rules
    rules:
...

Now we just need to define the contact details of the various recipients and routes. Under route, you can specify different recipients in routes. These recipients must of course also exist below. Conditions can also be defined for a route. The notification is only sent to the specified recipient if the conditions are met. Here is the config for the scenario: alertmanager-v2.values

alertmanager:
  config:
    global:
      resolve_timeout: 5m
      smtp_from: k8s-alertmanager@example.com
      smtp_smarthost: mail.example.com:587
      smtp_require_tls: true
      smtp_auth_username: k8s-alertmanager@example.com
      smtp_auth_password: xxxxxxxxx
    route:
      receiver: 'k8s-admin'
      repeat_interval: 5m
      routes:
      - receiver: 'dev_mail'
        match:
          instance: http://nws-sample-app.nws-development.svc
      - receiver: 'bereitschaft'
        match:
          instance: http://nws-sample-app.nws-production.svc
    receivers:
    - name: 'k8s-admin'
      email_configs:
      - to: k8s-admin@example.com
    - name: 'dev_mail'
      email_configs:
      - to: devs@example.com
    - name: 'bereitschaft'
      email_configs:
      - to: bereitschaft@example.com
      pushover_configs:
      - user_key: xxxxxxxxxxxxxxxxxxxxxxxxxxx
        token: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

alertmanager:
  config:
    global:
      resolve_timeout: 5m
      smtp_from: k8s-alertmanager@example.com
      smtp_smarthost: mail.example.com:587
      smtp_require_tls: true
      smtp_auth_username: k8s-alertmanager@example.com
      smtp_auth_password: xxxxxxxxx
    route:
      receiver: 'k8s-admin'
      repeat_interval: 5m
      routes:
      - receiver: 'dev_mail'
        match:
          instance: http://nws-sample-app.nws-development.svc
      - receiver: 'bereitschaft'
        match:
          instance: http://nws-sample-app.nws-production.svc
    receivers:
    - name: 'k8s-admin'
      email_configs:
      - to: k8s-admin@example.com
    - name: 'dev_mail'
      email_configs:
      - to: devs@example.com
    - name: 'bereitschaft'
      email_configs:
      - to: bereitschaft@example.com
      pushover_configs:
      - user_key: xxxxxxxxxxxxxxxxxxxxxxxxxxx
        token: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

helm upgrade --reuse-values -f prom-config/alertmanager-v2.values nws-prometheus-stack prometheus-community/kube-prometheus-stack

helm upgrade --reuse-values -f prom-config/alertmanager-v2.values nws-prometheus-stack prometheus-community/kube-prometheus-stack

It is best to restart the alert manager afterwards:

kubectl rollout restart statefulset.apps/alertmanager-nws-prometheus-stack-kube-alertmanager

kubectl rollout restart statefulset.apps/alertmanager-nws-prometheus-stack-kube-alertmanager

If the configuration is correct, alerts should arrive soon.

Conclusion

Setting up the alert manager can be time-consuming. However, you also have many configuration options and can set up your notifications as you need them using rules. If you want, you can also edit the templates of the messages and thus customize the format and the information they contain.

Our portfolio

0 Comments

Submit a Comment Cancel reply

How did you like our article?