Prometheus is an open-source monitoring software designed to monitor containerized applications in a microservice architecture. Given the increasing complexity of modern infrastructure, it is not easy to track an application’s state, health, and performance. Failure of one application might affect other applications’ ability to run correctly. Prometheus helps to track the condition, performance, or any custom metrics of an application over time, which allows an engineer to avoid unexpected situations and improve the overall functionality of microservice infrastructure. However, engineers can also use Prometheus to monitor monolithic infrastructure.

Before we delve into the nitty gritties of Prometheus, we should discuss some important terms:

Metric: A number with a name, measuring which has a meaning. For example: “cpu_usage = 1000 microns” is a metric.
Target: Any containerized application exporting metrics at ‘/metrics’ HTTP endpoint in Prometheus format.
Exporter: A library/code which converts existing metrics into Prometheus format.
Scrape: Pulling metrics from the target by making an HTTP request.

Prometheus Architecture

Prometheus has three major components :

Retrieval: This component is responsible for scraping the metrics from all targets at configured intervals of time.
Time Series DB: Stores the metrics scraped at regular intervals of time as a time-series data or as a vector.
HTTP Server: Accepts query over time-series data as an HTTP request and returns the result as an HTTP response. The query language used here is PromQL.

Prometheus Metric Types

Prometheus provides different metrics. Out of which, Counter, Gauge, Summary, and Histogram work in most situations. It’s the job of the application to provide the metrics in a predefined format that Prometheus understands. It is easy to publish these metrics from your application using the provided client libraries. Currently, libraries exist for popular languages such as GO, Python, Java, Ruby etc.

In this blog, we will be tackling the python version. However, it is easy to translate these concepts into other languages.

Counter

Any value that increases with time, such as HTTP request count, HTTP error response count, etc., can use counter metrics. A metric that can decrease can never use counter metrics. Counter has the advantage of querying the rate at which the value increases using the rate() function.

In the example below we are counting the number of function calls to a python def:

from prometheus_client import start_http_server, Counter
import random
import time

COUNTER = Counter('function_calls', 'number of times the function is called', ['module'])

def process_request(t):
    """A dummy function that takes some time."""
    COUNTER.labels('counter_demo').inc()
    time.sleep(t)

if __name__ == '__main__':
    start_http_server(8001)
    while True:
        process_request(random.random())

This code generates a counter metric named function_calls_total. We are incrementing function_calls_total every time we call the function process_request. Prometheus metrics have labels to identify similar metrics generated from different applications. In the above code, function_calls_total has a label named module with a value equal to counter_demo

The output of

curl http://localhost:8001/metrics

will be:

# HELP function_calls_total number of times the function is called
# TYPE function_calls_total counter
function_calls_total{module="counter_demo"} 22.0
# HELP function_calls_created number of times the function is called
# TYPE function_calls_created gauge
function_calls_created{module="counter_demo"} 1.6061945420716858e+09

Gauge

Any value that increases or decreases with time uses gauge metrics such as CPU usage, memory usage, and processing time.

In the example below we are calculating the time taken for the latest function call of python def process_request:

from prometheus_client import start_http_server, Gauge
import random
import time

TIME = Gauge('process_time', 'time taken for each function call', ['module'])

def process_request(t):
    """A dummy function that takes some time."""
    TIME.labels('gauge_demo').set(t)
    time.sleep(t)

if __name__ == '__main__':
    start_http_server(8002)
    while True:
        process_request(random.random())

Gauge metric supports set(x), inc(x), and dec(x) methods to set the metric, increment the metric, and decrement the metric by x respectively.

The output of http://localhost:8002/metrics will be:

# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="9",patchlevel="0",version="3.9.0"} 1.0
# HELP process_time number of times the function is called
# TYPE process_time gauge
process_time{module="gauge_demo"} 0.4709189918033123

Histogram

The histogram has predefined buckets of different sizes from 0.005 to 10. For example, if you want to measure/observe every HTTP request’s duration coming to your application, each request duration may fall under any predefined bucket. If a request’s duration is ten, then the value of the bucket of size ten is incremented by one. The histogram also has the sum of the duration of all requests and the number of requests.

In the example below we are observing the duration of a function call of python def: process_request:

h = Histogram('request_duration', 'Description of histogram', ['module'])
HIST = h.labels('histogram_demo')

def process_request(t):
    """A dummy function that takes some time."""
    HIST.observe(t)
    time.sleep(t)

The output of curl http://localhost:8003/metrics will be:

# HELP request_duration Description of histogram
# TYPE request_duration histogram
request_duration_bucket{le="0.005",module="histogram_demo"} 0.0
request_duration_bucket{le="0.01",module="histogram_demo"} 0.0
request_duration_bucket{le="0.025",module="histogram_demo"} 0.0
request_duration_bucket{le="0.05",module="histogram_demo"} 2.0
request_duration_bucket{le="0.075",module="histogram_demo"} 3.0
request_duration_bucket{le="0.1",module="histogram_demo"} 3.0
request_duration_bucket{le="0.25",module="histogram_demo"} 5.0
request_duration_bucket{le="0.5",module="histogram_demo"} 9.0
request_duration_bucket{le="0.75",module="histogram_demo"} 11.0
request_duration_bucket{le="1.0",module="histogram_demo"} 16.0
request_duration_bucket{le="2.5",module="histogram_demo"} 16.0
request_duration_bucket{le="5.0",module="histogram_demo"} 16.0
request_duration_bucket{le="7.5",module="histogram_demo"} 16.0
request_duration_bucket{le="10.0",module="histogram_demo"} 16.0
request_duration_bucket{le="+Inf",module="histogram_demo"} 16.0
request_duration_count{module="histogram_demo"} 16.0
request_duration_sum{module="histogram_demo"} 7.188765686771258
# HELP request_duration_created Description of histogram
# TYPE request_duration_created gauge
request_duration_created{module="histogram_demo"} 1.60620555290144e+09

request_duration_bucket are the buckets of size ranging from 0.005 to 10.
request_duration_sum is the sum of durations of each function call.
request_duration_count is the total number of function calls.

Summary

A summary is very similar to a histogram, except it doesn’t store the bucket information but only has the sum of the observations and count of total observations.

In the example below we are observing the duration of a function call of python def:

s = Summary('request_processing_seconds', 'Time spent \ 
processing request', ['module'])
SUMM = s.labels('pymo')

@SUMM.time()
def process_request(t):
    """A dummy function that takes some time."""
    time.sleep(t)

The output of curl http://localhost:8004/metrics will be:

# HELP request_processing_seconds Time spent processing request
# TYPE request_processing_seconds summary
request_processing_seconds_count{module="pymo"} 20.0
request_processing_seconds_sum{module="pymo"} 8.590708531
# HELP request_processing_seconds_created Time spent processing request
# TYPE request_processing_seconds_created gauge
request_processing_seconds_created{module="pymo"} 1.606206054827252e+09

Configuring Prometheus

So far we have seen the different types of metrics and how to generate them. Now let us see how to configure Prometheus to monitor the metrics we have developed. Prometheus uses a YAML file to define the configuration.


prometheus.yaml
global:
  scrape_interval: 15s 
  evaluation_interval: 15s 
  scrape_timeout: 10s 
scrape_configs:
  - job_name: 'prometheus_demo'
    static_configs:
      - targets: ['localhost:8001', 'localhost:8002', 'localhost:8003', 'localhost:8004']

The four primary configuration sections used in most situations are:

scrape_interval defines the interval in which Prometheus scrapes the targets
evaluation_interval determines how frequently Prometheus evaluates the rules defined in the rules file.
job_name defined will be added as a label to any time-series scraped from this config.
targets define the list of HTTP endpoints to scrape for metrics

Starting Prometheus

tar xvfz prometheus-*.tar.gz
./prometheus --config.file=proemtheus.yaml

Docker

docker run \
    -p 9090:9090 \
    -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus

Accessing UI

http://localhost:9090/graph

You can see time-series data as graphs as well.

Alerting

If you need an alert on function_calls_total > 100, you need to set up alert rules and alert managers. Prometheus uses another YAML file to define alert rules.

alertrules.yaml
groups:
- name: example
  rules:
  - alert: ExampleAlertName
    expr: function_calls_total{module="counter_demo"} > 100
    for: 10s
    labels:
      severity: low
    annotations:
      summary: example summary

prometheus.yaml
global:
  scrape_interval: 15s 
  evaluation_interval: 15s 
  scrape_timeout: 10s 
rule_files:
  - alertrules.yml
scrape_configs:
  - job_name: 'prometheus_demo'
    static_configs:
      - targets: ['localhost:8001', 'localhost:8002', 'localhost:8003', 'localhost:8004']

Once you restart Prometheus with the new configuration, you can see alerts listed in the alert section of the Prometheus UI. But these alerts reach your slack channel or email only if setup and configure the alert manager.