Why monitoring?

You have just completed your firmware for your microcontroller and have a bunch of them installed at your customer’s site. They seem to work, but you are unsure how they behave in the long run? You want to know if they are online, how much memory they uses and how many times they have restarted?

Well, you could implement a custom solution, but why reinvent the wheel? There are already many tools out there that can help you with this task.

In this tutorial, I will show you how to monitor your microcontroller devices with Prometheus and Grafana. I will use the ESP32 as an example. The approach shown is best suited when you have a bunch of devices you want to monitor and the monitoring is done on a server in the same network, for example in an industrial plant.

What is Prometheus?

Prometheus is an open-source monitoring system that was originally built by SoundCloud. It’s main purpose is to collect metrics from servers and applications and store them in a time-series database. It also provides a web interface where you can view the collected metrics and create alerts, but it is not very user-friendly. That’s why we will use Grafana to visualize the data.

So, how does Prometheus work? It uses a pull-based model, which means that it periodically scrapes metrics from the configured targets. The targets can be an application that exposes metrics, a server, or it can be your microcontroller device.

Prometheus has a node-exporter that can be used to collect metrics from the host system. It can be used to monitor the CPU, memory, disk space, network traffic, etc. The node-exporter is made for typical linux servers, but since we are using a microcontroller based device which has very limited resources, we will implement our own solution.

How to expose the metrics?

Prometheus expects the metrics in the Prometheus exposition format via a http endpoint. A typical endpoint is /metrics, on port 9100, but you can use other endpoints and ports. We’ll see later how to configure Prometheus. This means our microcontroller device needs to implement a http server that exposes the metrics in the correct format.

Metrics

The most important types of metrics are counters and gauges.

  • Counters are used to count the number of events that occur. They can only increase or be reset to zero. Examples for counters are the number of requests to a web server or the number of times a device has restarted.
  • Gauges are used to measure the current state of something. Examples for gauges are the current temperature or the amount of free memory.

Here is an example of the data that Prometheus expects:

# TYPE free_heap_bytes gauge
free_heap_bytes{} 122884
# TYPE restarts_total counter
restarts_total{} 43

Metric Labels

In the previous example, you have seen the metric name free_heap_bytes and restarts_total. These are the names of the metrics. But what are the curly braces {}? This is the place to put in labels.

Labels are used to distinguish between different instances of the same metric. For example, if you have two temperature sensors in your device, you can use labels to distinguish between them.

# TYPE temperature_celsius gauge
temperature_celsius{sensor="cpu_die"} 48.5
temperature_celsius{sensor="pcb"} 32.0

How to implement the metrics endpoint?

You can do it on your own, but for ESP32 you can use our ESP32 component esp_prometheus_exporter, which is a small open source library that implements the Prometheus endpoint with the exposition format and a set of APIs to define and update the metrics. The README explains how to use it.

Use this component as a git submodule in your project and place it in the components folder.

Setup Prometheus

In this example, we will use a docker container to run Prometheus and run it with docker compose.