A system you cannot observe is a system you cannot operate. As applications move to distributed architectures with microservices, containers, and cloud-native infrastructure, the need for comprehensive monitoring and observability becomes non-negotiable. When an incident occurs at 2 AM, the difference between a 5-minute resolution and a 5-hour firefight comes down to the quality of your observability stack.
Prometheus and Grafana have emerged as the industry standard open-source combination for metrics-based monitoring. Prometheus handles metrics collection and storage with a powerful query language, while Grafana provides visualization, dashboarding, and alerting across multiple data sources. This guide walks through building a production-grade observability stack from the ground up, including instrumenting applications, writing effective queries, building dashboards, and configuring alerts that actually help.
Monitoring vs. Observability: The Three Pillars
Before diving into tooling, it is important to understand what we are building and why. Monitoring and observability are related but distinct concepts.
Monitoring answers predefined questions: Is the server up? Is CPU usage above 90%? Are requests returning 500 errors? You decide in advance what to measure and set thresholds that trigger alerts. Monitoring tells you when something is wrong.
Observability lets you ask arbitrary questions about your system's behavior without deploying new code. When monitoring tells you something is wrong, observability helps you figure out why. An observable system exposes enough internal state that you can reason about its behavior from the outside.
The three pillars of observability are metrics, logs, and traces.
Metrics are numerical measurements collected over time. CPU usage, request latency, error rates, queue depths -- these are time-series data points that Prometheus excels at collecting and querying. Metrics are cheap to store, fast to query, and ideal for dashboards and alerting.
Logs are timestamped records of discrete events. A log entry might record that user X made a request to endpoint Y, which returned a 500 error with a specific stack trace. Logs provide rich detail but are expensive to store and slow to query at scale. Grafana Loki provides a Prometheus-like experience for log aggregation.
Traces follow a single request as it traverses multiple services in a distributed system. A trace shows that an API request hit the gateway, then the authentication service, then the order service, which called the inventory service and the payment service. Traces reveal latency bottlenecks and failure points across service boundaries. Grafana Tempo handles distributed trace storage and querying.
Together, these three pillars give you complete visibility into your systems. Prometheus and Grafana form the metrics foundation, with Loki and Tempo extending coverage to logs and traces.
Prometheus Architecture and Data Model
Prometheus follows a pull-based architecture. Rather than applications pushing metrics to a central server, Prometheus scrapes HTTP endpoints exposed by your applications and infrastructure at regular intervals. This design has several advantages: applications do not need to know about Prometheus, failed scrapes are immediately visible, and you can run Prometheus locally against production targets for debugging.
The core components are the Prometheus server (scrapes, stores, and queries metrics), client libraries (instrument your application code), exporters (expose metrics from third-party systems like databases and hardware), Alertmanager (handles alert routing, deduplication, and notification), and pushgateway (for short-lived batch jobs that cannot be scraped).
Prometheus stores data as time series, each identified by a metric name and a set of key-value labels:
http_requests_total{method="GET", endpoint="/api/products", status="200"} 14523
http_requests_total{method="POST", endpoint="/api/products", status="201"} 892
http_requests_total{method="GET", endpoint="/api/products", status="500"} 37
Metric types define how values behave:
- Counter -- a monotonically increasing value (total requests, total errors). You derive rates from counters.
- Gauge -- a value that can go up or down (current temperature, active connections, queue depth).
- Histogram -- samples observations and counts them in configurable buckets (request duration distribution).
- Summary -- similar to histogram but calculates quantiles on the client side.
A basic Prometheus configuration for scraping targets:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "api-services"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_ip]
action: replace
target_label: __address__
regex: (.+);(.+)
replacement: $2:$1
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
The Kubernetes service discovery configuration automatically finds pods annotated with prometheus.io/scrape: "true", which means new services are automatically monitored without updating Prometheus configuration.
Instrumenting Applications and Custom Metrics
Client libraries let you expose application-specific metrics directly from your code. Here is an example using the Prometheus client library in a Node.js Express application:
const express = require("express");
const promClient = require("prom-client");
// Create a registry
const register = new promClient.Registry();
// Collect default metrics (CPU, memory, event loop lag, etc.)
promClient.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new promClient.Histogram({
name: "http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [register],
});
const httpRequestsTotal = new promClient.Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "route", "status_code"],
registers: [register],
});
const activeConnections = new promClient.Gauge({
name: "active_connections",
help: "Number of active connections",
registers: [register],
});
// Business metrics
const ordersProcessed = new promClient.Counter({
name: "orders_processed_total",
help: "Total number of orders processed",
labelNames: ["status", "payment_method"],
registers: [register],
});
// Middleware to track request metrics
function metricsMiddleware(req, res, next) {
const start = process.hrtime.bigint();
activeConnections.inc();
res.on("finish", () => {
const duration = Number(process.hrtime.bigint() - start) / 1e9;
const route = req.route ? req.route.path : req.path;
httpRequestDuration
.labels(req.method, route, res.statusCode.toString())
.observe(duration);
httpRequestsTotal
.labels(req.method, route, res.statusCode.toString())
.inc();
activeConnections.dec();
});
next();
}
const app = express();
app.use(metricsMiddleware);
// Metrics endpoint for Prometheus to scrape
app.get("/metrics", async (req, res) => {
res.set("Content-Type", register.contentType);
res.end(await register.metrics());
});
For .NET applications, the prometheus-net library provides similar capabilities:
using Prometheus;
var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();
// Built-in HTTP metrics middleware
app.UseHttpMetrics();
// Custom business metrics
var ordersProcessed = Metrics.CreateCounter(
"orders_processed_total",
"Total orders processed",
new CounterConfiguration
{
LabelNames = new[] { "status", "payment_method" }
});
var orderValue = Metrics.CreateHistogram(
"order_value_dollars",
"Order value in dollars",
new HistogramConfiguration
{
Buckets = Histogram.LinearBuckets(10, 50, 20)
});
// Expose /metrics endpoint
app.MapMetrics();
app.MapPost("/api/orders", (Order order) =>
{
// Process order...
ordersProcessed.Labels("completed", order.PaymentMethod).Inc();
orderValue.Observe((double)order.Total);
return Results.Created();
});
app.Run();
The key insight with custom metrics is to measure what matters to the business, not just system resources. Request latency percentiles, error rates by endpoint, orders processed per minute, payment failures by provider -- these are the metrics that help you understand whether your system is serving its users well.
PromQL: Querying Metrics Effectively
PromQL is Prometheus's query language, and learning it well is essential for building useful dashboards and alerts. Here are the patterns you will use most often.
Rate of change for counters. Counters only go up, so the raw value is rarely useful. The rate() function calculates the per-second rate of increase over a time window:
# Requests per second over the last 5 minutes
rate(http_requests_total[5m])
# Error rate as a percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
Histogram quantiles for latency analysis:
# 95th percentile request duration over the last 10 minutes
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[10m]))
# 99th percentile by endpoint
histogram_quantile(0.99,
sum by (route, le) (rate(http_request_duration_seconds_bucket[5m]))
)
Aggregation across labels:
# Total requests per second across all instances, grouped by endpoint
sum by (route) (rate(http_requests_total[5m]))
# Top 5 endpoints by request volume
topk(5, sum by (route) (rate(http_requests_total[5m])))
# Average memory usage across all pods in a service
avg by (service) (container_memory_usage_bytes{namespace="production"})
Alerting expressions that detect real problems:
# High error rate: more than 5% of requests returning 5xx
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
# Latency SLO breach: p99 latency exceeding 1 second
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 1.0
# Disk space running low: less than 10% free
(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
Grafana Dashboards and Alerting
Grafana transforms raw Prometheus metrics into actionable visualizations. A well-designed dashboard gives operators immediate situational awareness and guides them toward root causes during incidents.
Dashboard design principles. Start with the RED method for services (Rate, Errors, Duration) and the USE method for resources (Utilization, Saturation, Errors). The top row of a service dashboard should show request rate, error rate, and latency percentiles. Deeper panels can break down by endpoint, instance, or error type.
Grafana dashboards can be provisioned as code using JSON or YAML files, enabling version control and reproducibility:
# provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: "Services"
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
A dashboard JSON model for a service overview:
{
"dashboard": {
"title": "API Service Overview",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\"}[5m]))",
"legendFormat": "Total RPS"
}
],
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 0 }
},
{
"title": "Error Rate (%)",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\",status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"api-service\"}[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
}
}
},
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 0 }
},
{
"title": "Latency Percentiles",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m])))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m])))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m])))",
"legendFormat": "p99"
}
],
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 0 }
}
]
}
}
Alerting with Alertmanager routes alerts to the right people through the right channels. Prometheus evaluates alerting rules and sends firing alerts to Alertmanager, which handles grouping, deduplication, silencing, and routing:
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
route:
receiver: "default-slack"
group_by: ["alertname", "service"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "pagerduty-critical"
repeat_interval: 15m
- match:
severity: warning
receiver: "slack-warnings"
repeat_interval: 1h
receivers:
- name: "default-slack"
slack_configs:
- channel: "#alerts"
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: "pagerduty-critical"
pagerduty_configs:
- service_key: "YOUR_PAGERDUTY_SERVICE_KEY"
severity: critical
- name: "slack-warnings"
slack_configs:
- channel: "#alerts-warnings"
Prometheus alerting rules define the conditions and metadata:
# alerts/service-alerts.yml
groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "{{ $labels.service }} is returning >5% errors for the last 5 minutes. Current value: {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))) > 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "High p99 latency on {{ $labels.service }}"
description: "{{ $labels.service }} p99 latency is above 2 seconds. Current value: {{ $value | humanizeDuration }}"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been restarting repeatedly."
Extending the Stack: Loki for Logs and Tempo for Traces
A complete observability stack integrates metrics, logs, and traces. Grafana Loki stores and queries logs using the same label-based approach as Prometheus, and Grafana Tempo provides scalable distributed trace storage.
Adding Loki as a Grafana data source lets you correlate logs with metrics. When you see a spike in error rates on a Grafana dashboard, you can click through to see the actual error logs for that time window using LogQL:
{namespace="production", app="api-service"} |= "error" | json | status_code >= 500
Tempo integrates with OpenTelemetry to ingest distributed traces. The real power emerges when you link all three pillars: a Grafana dashboard shows a latency spike, you click a data point to see exemplar traces for that time range, and from a trace you jump to the relevant logs for each span. This workflow turns what would be hours of investigation into minutes.
Deploying the full stack with Docker Compose for local development:
version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts:/etc/prometheus/alerts
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
volumes:
- ./provisioning:/etc/grafana/provisioning
- ./dashboards:/var/lib/grafana/dashboards
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
ports:
- "3000:3000"
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
tempo:
image: grafana/tempo:latest
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
The combination of Prometheus, Grafana, Loki, and Tempo gives you a unified observability platform that rivals expensive commercial solutions. All four tools are open source, horizontally scalable, and designed to work together seamlessly.
Building an effective monitoring and observability practice is as much about culture as it is about tooling. Dashboards that no one looks at and alerts that everyone ignores are worse than useless -- they create a false sense of security. Start with a small number of high-signal dashboards and alerts, iterate based on what you learn from incidents, and make observability a first-class concern in your development process rather than an afterthought.
At Maranatha Technologies, we design and implement monitoring and observability solutions tailored to your infrastructure and team workflows. From initial Prometheus and Grafana setup to building custom dashboards, alerting pipelines, and full-stack observability with Loki and Tempo, our DevOps and infrastructure services ensure your team has the visibility it needs to operate with confidence. Get in touch to discuss your observability strategy.