In this episode, we'll discuss Observability for monitoring, logging, and tracing Kubernetes applications. We'll learn about metrics, logs, traces, and best practices for implementing observability in Kubernetes.

Note
If you want to read the previous episode, you can click the Episode 41 thumbnail below
In the previous episode, we explored External Secret Manager, which provides secure secret management for Kubernetes applications. Now we'll dive into Observability, which enables you to understand what's happening inside your Kubernetes cluster.
Note: Here I'll be using a Kubernetes Cluster installed through K3s.
Observability is the ability to understand the internal state of a system based on its external outputs. In Kubernetes, observability consists of three pillars: metrics, logs, and traces. Think of observability like having X-ray vision for your cluster - you can see what's happening, diagnose problems, and optimize performance.
Observability is different from monitoring. Monitoring tells you when something is wrong. Observability helps you understand why it's wrong.
1. Metrics
Quantitative measurements of system behavior over time.
2. Logs
Detailed records of events that occurred in the system.
3. Traces
Records of requests flowing through the system.
1. Troubleshooting
Quickly identify and fix issues.
2. Performance Optimization
Understand bottlenecks and optimize.
3. Capacity Planning
Plan for future growth.
4. Security
Detect anomalies and security issues.
5. Compliance
Meet audit and compliance requirements.
Metrics are quantitative measurements collected at regular intervals.
Prometheus is the de facto standard for Kubernetes metrics.
Installation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stackPrometheus Scrape Config
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true# Node metrics
node_cpu_seconds_total
node_memory_MemAvailable_bytes
node_disk_io_time_seconds_total
# Pod metrics
container_cpu_usage_seconds_total
container_memory_usage_bytes
container_network_receive_bytes_total
# Kubernetes metrics
kube_pod_status_phase
kube_deployment_status_replicas
kube_node_status_condition# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod_name)
# Memory usage per node
sum(container_memory_usage_bytes) by (node)
# Pod restart count
kube_pod_container_status_restarts_total
# Deployment replica mismatch
kube_deployment_status_replicas_desired - kube_deployment_status_replicas_availableLogs provide detailed records of events.
# View pod logs
kubectl logs pod-name
# View logs from specific container
kubectl logs pod-name -c container-name
# Stream logs
kubectl logs -f pod-name
# View previous logs (if pod crashed)
kubectl logs pod-name --previousELK Stack (Elasticsearch, Logstash, Kibana)
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch
helm install kibana elastic/kibana
helm install logstash elastic/logstashFluentd Configuration
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<match kubernetes.**>
@type elasticsearch
@id output_elasticsearch
@log_level info
include_tag_key true
host elasticsearch
port 9200
path_prefix logstash
logstash_format true
logstash_prefix logstash
logstash_prefix_separator _
include_timestamp false
type_name _doc
</match>{
"timestamp": "2026-03-01T10:30:45.123Z",
"level": "INFO",
"service": "web-app",
"pod": "web-app-5d4f7c6b9-abc12",
"namespace": "production",
"message": "Request processed",
"request_id": "req-12345",
"duration_ms": 145,
"status_code": 200
}Traces track requests flowing through the system.
Jaeger is a distributed tracing platform.
Installation
kubectl create namespace jaeger
kubectl apply -n jaeger -f https://raw.githubusercontent.com/jaegertracing/jaeger-kubernetes/main/jaeger-all-in-one-template.ymlInstrumentation Example
from jaeger_client import Config
from opentracing_instrumentation.local_span import LocalSpanManager
def init_tracer(service_name):
config = Config(
config={
'sampler': {
'type': 'const',
'param': 1,
},
'logging': True,
},
service_name=service_name,
)
return config.initialize_tracer()
tracer = init_tracer('my-service')
with tracer.start_active_span('my-operation') as scope:
# Do work
pass# Propagate trace context across services
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: congo=t61rcZ94W243groups:
- name: kubernetes.rules
interval: 30s
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
for: 5m
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
annotations:
summary: "Node {{ $labels.node }} is not ready"
- alert: HighMemoryUsage
expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
for: 5m
annotations:
summary: "Pod {{ $labels.pod }} memory usage is above 90%"{
"dashboard": {
"title": "Kubernetes Cluster",
"panels": [
{
"title": "CPU Usage",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (pod_name)"
}
]
},
{
"title": "Memory Usage",
"targets": [
{
"expr": "sum(container_memory_usage_bytes) by (pod_name)"
}
]
},
{
"title": "Pod Restarts",
"targets": [
{
"expr": "kube_pod_container_status_restarts_total"
}
]
}
]
}
}# Find errors in production
{
"query": {
"bool": {
"must": [
{ "match": { "level": "ERROR" } },
{ "match": { "namespace": "production" } },
{ "range": { "timestamp": { "gte": "now-1h" } } }
]
}
}
}Problem: No visibility into application behavior.
Solution: Add instrumentation:
from prometheus_client import Counter, Histogram
request_count = Counter('requests_total', 'Total requests')
request_duration = Histogram('request_duration_seconds', 'Request duration')
@request_duration.time()
def handle_request():
request_count.inc()
# Handle requestProblem: High storage costs and performance impact.
Solution: Sample strategically:
# Sample 10% of traces
sampler:
type: probabilistic
param: 0.1Problem: Issues go unnoticed.
Solution: Configure meaningful alerts:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5mProblem: Logs fill up storage.
Solution: Set retention policies:
# Keep logs for 30 days
index.lifecycle.name: logs
index.lifecycle.rollover_alias: logs
index.lifecycle.delete.min_age: 30dProblem: Can't connect metrics, logs, and traces.
Solution: Use correlation IDs:
# Include request ID in all outputs
request_id: "req-12345"
trace_id: "trace-12345"{
"timestamp": "2026-03-01T10:30:45Z",
"level": "INFO",
"service": "web-app",
"request_id": "req-12345",
"message": "Request processed"
}from prometheus_client import Counter, Histogram
request_count = Counter('requests_total', 'Total requests')
request_duration = Histogram('request_duration_seconds', 'Request duration')- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1# Propagate across services
X-Request-ID: req-12345
X-Trace-ID: trace-12345# Monitor Prometheus itself
prometheus_tsdb_symbol_table_size_bytes
prometheus_tsdb_wal_corruptions_total# Keep metrics for 15 days
retention: 15dCreate dashboards for different audiences:
# Metrics Documentation
## request_duration_seconds
- Description: HTTP request duration
- Unit: seconds
- Labels: method, endpoint, status┌─────────────────────────────────────┐
│ Applications │
│ (Instrumented with metrics) │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Data Collection │
│ - Prometheus (metrics) │
│ - Fluentd (logs) │
│ - Jaeger (traces) │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Storage │
│ - Prometheus TSDB │
│ - Elasticsearch │
│ - Jaeger backend │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Visualization & Alerting │
│ - Grafana (dashboards) │
│ - Kibana (logs) │
│ - AlertManager (alerts) │
└─────────────────────────────────────┘| Aspect | Monitoring | Observability |
|---|---|---|
| Focus | Known unknowns | Unknown unknowns |
| Approach | Predefined metrics | Exploratory analysis |
| Alerts | Threshold-based | Anomaly-based |
| Debugging | Limited | Comprehensive |
| Cost | Lower | Higher |
In episode 42, we've explored Observability in Kubernetes in depth. We've learned about metrics, logs, traces, and best practices for implementing observability.
Key takeaways:
Observability is essential for operating production Kubernetes clusters reliably and efficiently.
Note
If you want to continue to the next episode, you can click the Episode 43 thumbnail below