Belajar Kubernetes - Episode 42 - Pengenalan dan Penjelasan Observability

Belajar Kubernetes - Episode 42 - Pengenalan dan Penjelasan Observability

Di episode ini kita akan coba bahas Observability untuk monitoring, logging, dan tracing Kubernetes application. Kita akan mempelajari metrics, logs, traces, dan best practice untuk implement observability di Kubernetes.

Arman Dwi Pangestu
Arman Dwi PangestuApril 17, 2026
0 views
5 min read

Pendahuluan

Catatan

Untuk kalian yang ingin membaca episode sebelumnya, bisa click thumbnail episode 41 di bawah ini

Episode 41Episode 41

Di episode sebelumnya, kita menjelajahi External Secret Manager, yang menyediakan secure secret management untuk Kubernetes application. Sekarang kita akan mendalami Observability, yang enable Anda untuk understand apa yang terjadi di dalam Kubernetes cluster Anda.

Catatan: Disini saya akan menggunakan Kubernetes Cluster yang di install melalui K3s.

Observability adalah ability untuk understand internal state dari system berdasarkan external output-nya. Di Kubernetes, observability terdiri dari tiga pillar: metrics, logs, dan traces. Pikirkan observability seperti punya X-ray vision untuk cluster Anda - Anda bisa lihat apa yang terjadi, diagnose problem, dan optimize performance.

Memahami Observability

Observability berbeda dari monitoring. Monitoring memberitahu Anda ketika ada yang salah. Observability membantu Anda understand mengapa itu salah.

Tiga Pillar Observability

1. Metrics

Quantitative measurement dari system behavior over time.

2. Logs

Detailed record dari event yang terjadi di system.

3. Traces

Record dari request yang flow melalui system.

Mengapa Observability Penting

1. Troubleshooting

Quickly identify dan fix issue.

2. Performance Optimization

Understand bottleneck dan optimize.

3. Capacity Planning

Plan untuk future growth.

4. Security

Detect anomaly dan security issue.

5. Compliance

Meet audit dan compliance requirement.

Metrics

Metrics adalah quantitative measurement yang di-collect di regular interval.

Prometheus

Prometheus adalah de facto standard untuk Kubernetes metrics.

Installation

Kubernetesbash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack

Prometheus Scrape Config

Kubernetesprometheus-config.yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
scrape_configs:
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Common Kubernetes Metrics

KubernetesCommon Metrics
# Node metrics
node_cpu_seconds_total
node_memory_MemAvailable_bytes
node_disk_io_time_seconds_total
 
# Pod metrics
container_cpu_usage_seconds_total
container_memory_usage_bytes
container_network_receive_bytes_total
 
# Kubernetes metrics
kube_pod_status_phase
kube_deployment_status_replicas
kube_node_status_condition

Querying Metrics

KubernetesPromQL Queries
# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod_name)
 
# Memory usage per node
sum(container_memory_usage_bytes) by (node)
 
# Pod restart count
kube_pod_container_status_restarts_total
 
# Deployment replica mismatch
kube_deployment_status_replicas_desired - kube_deployment_status_replicas_available

Logging

Logs menyediakan detailed record dari event.

Container Logs

Kubernetesbash
# View pod logs
kubectl logs pod-name
 
# View logs dari specific container
kubectl logs pod-name -c container-name
 
# Stream logs
kubectl logs -f pod-name
 
# View previous logs (jika pod crash)
kubectl logs pod-name --previous

Centralized Logging

ELK Stack (Elasticsearch, Logstash, Kibana)

Kubernetesbash
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch
helm install kibana elastic/kibana
helm install logstash elastic/logstash

Fluentd Configuration

Kubernetesfluentd-config.yaml
<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type json
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>
 
<match kubernetes.**>
  @type elasticsearch
  @id output_elasticsearch
  @log_level info
  include_tag_key true
  host elasticsearch
  port 9200
  path_prefix logstash
  logstash_format true
  logstash_prefix logstash
  logstash_prefix_separator _
  include_timestamp false
  type_name _doc
</match>

Structured Logging

KubernetesStructured Log Example
{
  "timestamp": "2026-03-01T10:30:45.123Z",
  "level": "INFO",
  "service": "web-app",
  "pod": "web-app-5d4f7c6b9-abc12",
  "namespace": "production",
  "message": "Request processed",
  "request_id": "req-12345",
  "duration_ms": 145,
  "status_code": 200
}

Traces

Traces track request yang flow melalui system.

Jaeger

Jaeger adalah distributed tracing platform.

Installation

Kubernetesbash
kubectl create namespace jaeger
kubectl apply -n jaeger -f https://raw.githubusercontent.com/jaegertracing/jaeger-kubernetes/main/jaeger-all-in-one-template.yml

Instrumentation Example

KubernetesPython with Jaeger
from jaeger_client import Config
from opentracing_instrumentation.local_span import LocalSpanManager
 
def init_tracer(service_name):
    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
        },
        service_name=service_name,
    )
    return config.initialize_tracer()
 
tracer = init_tracer('my-service')
 
with tracer.start_active_span('my-operation') as scope:
    # Do work
    pass

Trace Context

KubernetesTrace Headers
# Propagate trace context across services
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: congo=t61rcZ94W243

Contoh Praktis

Prometheus Alert Rules

Kubernetesalert-rules.yaml
groups:
  - name: kubernetes.rules
    interval: 30s
    rules:
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
        for: 5m
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
 
      - alert: NodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 5m
        annotations:
          summary: "Node {{ $labels.node }} is not ready"
 
      - alert: HighMemoryUsage
        expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
        for: 5m
        annotations:
          summary: "Pod {{ $labels.pod }} memory usage is above 90%"

Grafana Dashboard

Kubernetesgrafana-dashboard.json
{
  "dashboard": {
    "title": "Kubernetes Cluster",
    "panels": [
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (pod_name)"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "sum(container_memory_usage_bytes) by (pod_name)"
          }
        ]
      },
      {
        "title": "Pod Restarts",
        "targets": [
          {
            "expr": "kube_pod_container_status_restarts_total"
          }
        ]
      }
    ]
  }
}

Log Aggregation Query

Kuberneteskibana-query.yaml
# Find errors di production
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "ERROR" } },
        { "match": { "namespace": "production" } },
        { "range": { "timestamp": { "gte": "now-1h" } } }
      ]
    }
  }
}

Kesalahan dan Jebakan Umum

Kesalahan 1: Not Instrumenting Application

Problem: No visibility ke application behavior.

Solution: Add instrumentation:

KubernetesCorrect: Instrumented
from prometheus_client import Counter, Histogram
 
request_count = Counter('requests_total', 'Total requests')
request_duration = Histogram('request_duration_seconds', 'Request duration')
 
@request_duration.time()
def handle_request():
    request_count.inc()
    # Handle request

Kesalahan 2: Collecting Too Much Data

Problem: High storage cost dan performance impact.

Solution: Sample strategically:

Kubernetesyaml
# Sample 10% dari traces
sampler:
  type: probabilistic
  param: 0.1

Kesalahan 3: Not Setting Up Alert

Problem: Issue go unnoticed.

Solution: Configure meaningful alert:

Kubernetesyaml
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m

Kesalahan 4: Ignoring Log Retention

Problem: Logs fill up storage.

Solution: Set retention policy:

Kubernetesyaml
# Keep logs untuk 30 hari
index.lifecycle.name: logs
index.lifecycle.rollover_alias: logs
index.lifecycle.delete.min_age: 30d

Kesalahan 5: Not Correlating Data

Problem: Can't connect metrics, logs, dan traces.

Solution: Use correlation ID:

Kubernetesyaml
# Include request ID di semua output
request_id: "req-12345"
trace_id: "trace-12345"

Praktik Terbaik

1. Gunakan Structured Logging

Kubernetesjson
{
  "timestamp": "2026-03-01T10:30:45Z",
  "level": "INFO",
  "service": "web-app",
  "request_id": "req-12345",
  "message": "Request processed"
}

2. Instrument Application

Kubernetespython
from prometheus_client import Counter, Histogram
 
request_count = Counter('requests_total', 'Total requests')
request_duration = Histogram('request_duration_seconds', 'Request duration')

3. Set Up Meaningful Alert

Kubernetesyaml
- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1

4. Gunakan Correlation ID

Kubernetesyaml
# Propagate across services
X-Request-ID: req-12345
X-Trace-ID: trace-12345

5. Monitor the Monitoring System

Kubernetesbash
# Monitor Prometheus itself
prometheus_tsdb_symbol_table_size_bytes
prometheus_tsdb_wal_corruptions_total

6. Set Appropriate Retention

Kubernetesyaml
# Keep metrics untuk 15 hari
retention: 15d

7. Gunakan Dashboard untuk Visualization

Create dashboard untuk different audience:

  • Operations: System health
  • Developers: Application performance
  • Business: User experience

8. Document Metrics dan Alert

Kubernetesmarkdown
# Metrics Documentation
 
## request_duration_seconds
- Description: HTTP request duration
- Unit: seconds
- Labels: method, endpoint, status

Observability Stack

Complete Stack

plaintext
┌─────────────────────────────────────┐
│      Application                    │
│  (Instrumented with metrics)        │
└──────────────┬──────────────────────┘

┌──────────────▼──────────────────────┐
│      Data Collection                │
│  - Prometheus (metrics)             │
│  - Fluentd (logs)                   │
│  - Jaeger (traces)                  │
└──────────────┬──────────────────────┘

┌──────────────▼──────────────────────┐
│      Storage                        │
│  - Prometheus TSDB                  │
│  - Elasticsearch                    │
│  - Jaeger backend                   │
└──────────────┬──────────────────────┘

┌──────────────▼──────────────────────┐
│      Visualization & Alerting       │
│  - Grafana (dashboards)             │
│  - Kibana (logs)                    │
│  - AlertManager (alerts)            │
└─────────────────────────────────────┘

Monitoring vs Observability

AspekMonitoringObservability
FocusKnown unknownsUnknown unknowns
ApproachPredefined metricsExploratory analysis
AlertsThreshold-basedAnomaly-based
DebuggingLimitedComprehensive
CostLowerHigher

Kesimpulan

Pada episode 42 ini, kita telah membahas Observability di Kubernetes secara mendalam. Kita sudah belajar tentang metrics, logs, traces, dan best practice untuk implement observability.

Key takeaway:

  • Observability enable understanding system behavior
  • Tiga Pillar - Metrics, Logs, Traces
  • Prometheus - Metrics collection dan storage
  • Grafana - Metrics visualization
  • ELK Stack - Log aggregation dan analysis
  • Jaeger - Distributed tracing
  • Structured Logging - JSON formatted logs
  • Instrumentation - Add metrics ke application
  • Correlation ID - Connect related event
  • Alert - Notify pada anomaly
  • Dashboard - Visualize system state
  • Retention Policy - Manage storage
  • Monitor the Monitor - Ensure observability system health
  • Document Metrics - Help team understand data
  • Correlate Data - Connect metrics, logs, traces

Observability essential untuk operate production Kubernetes cluster securely dan efficiently.

Catatan

Untuk kalian yang ingin melanjutkan ke episode selanjutnya, bisa click thumbnail episode 43 di bawah ini

Episode 43Episode 43

Related Posts