Learning Kubernetes - Episode 42 - Introduction and Explanation of Observability

Learning Kubernetes - Episode 42 - Introduction and Explanation of Observability

In this episode, we'll discuss Observability for monitoring, logging, and tracing Kubernetes applications. We'll learn about metrics, logs, traces, and best practices for implementing observability in Kubernetes.

Arman Dwi Pangestu
Arman Dwi PangestuApril 17, 2026
0 views
5 min read

Introduction

Note

If you want to read the previous episode, you can click the Episode 41 thumbnail below

Episode 41Episode 41

In the previous episode, we explored External Secret Manager, which provides secure secret management for Kubernetes applications. Now we'll dive into Observability, which enables you to understand what's happening inside your Kubernetes cluster.

Note: Here I'll be using a Kubernetes Cluster installed through K3s.

Observability is the ability to understand the internal state of a system based on its external outputs. In Kubernetes, observability consists of three pillars: metrics, logs, and traces. Think of observability like having X-ray vision for your cluster - you can see what's happening, diagnose problems, and optimize performance.

Understanding Observability

Observability is different from monitoring. Monitoring tells you when something is wrong. Observability helps you understand why it's wrong.

The Three Pillars of Observability

1. Metrics

Quantitative measurements of system behavior over time.

2. Logs

Detailed records of events that occurred in the system.

3. Traces

Records of requests flowing through the system.

Why Observability Matters

1. Troubleshooting

Quickly identify and fix issues.

2. Performance Optimization

Understand bottlenecks and optimize.

3. Capacity Planning

Plan for future growth.

4. Security

Detect anomalies and security issues.

5. Compliance

Meet audit and compliance requirements.

Metrics

Metrics are quantitative measurements collected at regular intervals.

Prometheus

Prometheus is the de facto standard for Kubernetes metrics.

Installation

Kubernetesbash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack

Prometheus Scrape Config

Kubernetesprometheus-config.yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
scrape_configs:
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Common Kubernetes Metrics

KubernetesCommon Metrics
# Node metrics
node_cpu_seconds_total
node_memory_MemAvailable_bytes
node_disk_io_time_seconds_total
 
# Pod metrics
container_cpu_usage_seconds_total
container_memory_usage_bytes
container_network_receive_bytes_total
 
# Kubernetes metrics
kube_pod_status_phase
kube_deployment_status_replicas
kube_node_status_condition

Querying Metrics

KubernetesPromQL Queries
# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod_name)
 
# Memory usage per node
sum(container_memory_usage_bytes) by (node)
 
# Pod restart count
kube_pod_container_status_restarts_total
 
# Deployment replica mismatch
kube_deployment_status_replicas_desired - kube_deployment_status_replicas_available

Logging

Logs provide detailed records of events.

Container Logs

Kubernetesbash
# View pod logs
kubectl logs pod-name
 
# View logs from specific container
kubectl logs pod-name -c container-name
 
# Stream logs
kubectl logs -f pod-name
 
# View previous logs (if pod crashed)
kubectl logs pod-name --previous

Centralized Logging

ELK Stack (Elasticsearch, Logstash, Kibana)

Kubernetesbash
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch
helm install kibana elastic/kibana
helm install logstash elastic/logstash

Fluentd Configuration

Kubernetesfluentd-config.yaml
<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type json
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>
 
<match kubernetes.**>
  @type elasticsearch
  @id output_elasticsearch
  @log_level info
  include_tag_key true
  host elasticsearch
  port 9200
  path_prefix logstash
  logstash_format true
  logstash_prefix logstash
  logstash_prefix_separator _
  include_timestamp false
  type_name _doc
</match>

Structured Logging

KubernetesStructured Log Example
{
  "timestamp": "2026-03-01T10:30:45.123Z",
  "level": "INFO",
  "service": "web-app",
  "pod": "web-app-5d4f7c6b9-abc12",
  "namespace": "production",
  "message": "Request processed",
  "request_id": "req-12345",
  "duration_ms": 145,
  "status_code": 200
}

Traces

Traces track requests flowing through the system.

Jaeger

Jaeger is a distributed tracing platform.

Installation

Kubernetesbash
kubectl create namespace jaeger
kubectl apply -n jaeger -f https://raw.githubusercontent.com/jaegertracing/jaeger-kubernetes/main/jaeger-all-in-one-template.yml

Instrumentation Example

KubernetesPython with Jaeger
from jaeger_client import Config
from opentracing_instrumentation.local_span import LocalSpanManager
 
def init_tracer(service_name):
    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
        },
        service_name=service_name,
    )
    return config.initialize_tracer()
 
tracer = init_tracer('my-service')
 
with tracer.start_active_span('my-operation') as scope:
    # Do work
    pass

Trace Context

KubernetesTrace Headers
# Propagate trace context across services
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: congo=t61rcZ94W243

Practical Examples

Prometheus Alert Rules

Kubernetesalert-rules.yaml
groups:
  - name: kubernetes.rules
    interval: 30s
    rules:
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
        for: 5m
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
 
      - alert: NodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 5m
        annotations:
          summary: "Node {{ $labels.node }} is not ready"
 
      - alert: HighMemoryUsage
        expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
        for: 5m
        annotations:
          summary: "Pod {{ $labels.pod }} memory usage is above 90%"

Grafana Dashboard

Kubernetesgrafana-dashboard.json
{
  "dashboard": {
    "title": "Kubernetes Cluster",
    "panels": [
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (pod_name)"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "sum(container_memory_usage_bytes) by (pod_name)"
          }
        ]
      },
      {
        "title": "Pod Restarts",
        "targets": [
          {
            "expr": "kube_pod_container_status_restarts_total"
          }
        ]
      }
    ]
  }
}

Log Aggregation Query

Kuberneteskibana-query.yaml
# Find errors in production
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "ERROR" } },
        { "match": { "namespace": "production" } },
        { "range": { "timestamp": { "gte": "now-1h" } } }
      ]
    }
  }
}

Common Mistakes and Pitfalls

Mistake 1: Not Instrumenting Applications

Problem: No visibility into application behavior.

Solution: Add instrumentation:

KubernetesCorrect: Instrumented
from prometheus_client import Counter, Histogram
 
request_count = Counter('requests_total', 'Total requests')
request_duration = Histogram('request_duration_seconds', 'Request duration')
 
@request_duration.time()
def handle_request():
    request_count.inc()
    # Handle request

Mistake 2: Collecting Too Much Data

Problem: High storage costs and performance impact.

Solution: Sample strategically:

Kubernetesyaml
# Sample 10% of traces
sampler:
  type: probabilistic
  param: 0.1

Mistake 3: Not Setting Up Alerts

Problem: Issues go unnoticed.

Solution: Configure meaningful alerts:

Kubernetesyaml
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m

Mistake 4: Ignoring Log Retention

Problem: Logs fill up storage.

Solution: Set retention policies:

Kubernetesyaml
# Keep logs for 30 days
index.lifecycle.name: logs
index.lifecycle.rollover_alias: logs
index.lifecycle.delete.min_age: 30d

Mistake 5: Not Correlating Data

Problem: Can't connect metrics, logs, and traces.

Solution: Use correlation IDs:

Kubernetesyaml
# Include request ID in all outputs
request_id: "req-12345"
trace_id: "trace-12345"

Best Practices

1. Use Structured Logging

Kubernetesjson
{
  "timestamp": "2026-03-01T10:30:45Z",
  "level": "INFO",
  "service": "web-app",
  "request_id": "req-12345",
  "message": "Request processed"
}

2. Instrument Applications

Kubernetespython
from prometheus_client import Counter, Histogram
 
request_count = Counter('requests_total', 'Total requests')
request_duration = Histogram('request_duration_seconds', 'Request duration')

3. Set Up Meaningful Alerts

Kubernetesyaml
- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1

4. Use Correlation IDs

Kubernetesyaml
# Propagate across services
X-Request-ID: req-12345
X-Trace-ID: trace-12345

5. Monitor the Monitoring System

Kubernetesbash
# Monitor Prometheus itself
prometheus_tsdb_symbol_table_size_bytes
prometheus_tsdb_wal_corruptions_total

6. Set Appropriate Retention

Kubernetesyaml
# Keep metrics for 15 days
retention: 15d

7. Use Dashboards for Visualization

Create dashboards for different audiences:

  • Operations: System health
  • Developers: Application performance
  • Business: User experience

8. Document Metrics and Alerts

Kubernetesmarkdown
# Metrics Documentation
 
## request_duration_seconds
- Description: HTTP request duration
- Unit: seconds
- Labels: method, endpoint, status

Observability Stack

Complete Stack

plaintext
┌─────────────────────────────────────┐
│      Applications                   │
│  (Instrumented with metrics)        │
└──────────────┬──────────────────────┘

┌──────────────▼──────────────────────┐
│      Data Collection                │
│  - Prometheus (metrics)             │
│  - Fluentd (logs)                   │
│  - Jaeger (traces)                  │
└──────────────┬──────────────────────┘

┌──────────────▼──────────────────────┐
│      Storage                        │
│  - Prometheus TSDB                  │
│  - Elasticsearch                    │
│  - Jaeger backend                   │
└──────────────┬──────────────────────┘

┌──────────────▼──────────────────────┐
│      Visualization & Alerting       │
│  - Grafana (dashboards)             │
│  - Kibana (logs)                    │
│  - AlertManager (alerts)            │
└─────────────────────────────────────┘

Monitoring vs Observability

AspectMonitoringObservability
FocusKnown unknownsUnknown unknowns
ApproachPredefined metricsExploratory analysis
AlertsThreshold-basedAnomaly-based
DebuggingLimitedComprehensive
CostLowerHigher

Conclusion

In episode 42, we've explored Observability in Kubernetes in depth. We've learned about metrics, logs, traces, and best practices for implementing observability.

Key takeaways:

  • Observability enables understanding system behavior
  • Three Pillars - Metrics, Logs, Traces
  • Prometheus - Metrics collection and storage
  • Grafana - Metrics visualization
  • ELK Stack - Log aggregation and analysis
  • Jaeger - Distributed tracing
  • Structured Logging - JSON formatted logs
  • Instrumentation - Add metrics to applications
  • Correlation IDs - Connect related events
  • Alerts - Notify on anomalies
  • Dashboards - Visualize system state
  • Retention Policies - Manage storage
  • Monitoring the Monitor - Ensure observability system health
  • Document Metrics - Help teams understand data
  • Correlate Data - Connect metrics, logs, traces

Observability is essential for operating production Kubernetes clusters reliably and efficiently.

Note

If you want to continue to the next episode, you can click the Episode 43 thumbnail below

Episode 43Episode 43

Related Posts