Di episode ini kita akan coba bahas Observability untuk monitoring, logging, dan tracing Kubernetes application. Kita akan mempelajari metrics, logs, traces, dan best practice untuk implement observability di Kubernetes.

Catatan
Untuk kalian yang ingin membaca episode sebelumnya, bisa click thumbnail episode 41 di bawah ini
Di episode sebelumnya, kita menjelajahi External Secret Manager, yang menyediakan secure secret management untuk Kubernetes application. Sekarang kita akan mendalami Observability, yang enable Anda untuk understand apa yang terjadi di dalam Kubernetes cluster Anda.
Catatan: Disini saya akan menggunakan Kubernetes Cluster yang di install melalui K3s.
Observability adalah ability untuk understand internal state dari system berdasarkan external output-nya. Di Kubernetes, observability terdiri dari tiga pillar: metrics, logs, dan traces. Pikirkan observability seperti punya X-ray vision untuk cluster Anda - Anda bisa lihat apa yang terjadi, diagnose problem, dan optimize performance.
Observability berbeda dari monitoring. Monitoring memberitahu Anda ketika ada yang salah. Observability membantu Anda understand mengapa itu salah.
1. Metrics
Quantitative measurement dari system behavior over time.
2. Logs
Detailed record dari event yang terjadi di system.
3. Traces
Record dari request yang flow melalui system.
1. Troubleshooting
Quickly identify dan fix issue.
2. Performance Optimization
Understand bottleneck dan optimize.
3. Capacity Planning
Plan untuk future growth.
4. Security
Detect anomaly dan security issue.
5. Compliance
Meet audit dan compliance requirement.
Metrics adalah quantitative measurement yang di-collect di regular interval.
Prometheus adalah de facto standard untuk Kubernetes metrics.
Installation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stackPrometheus Scrape Config
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true# Node metrics
node_cpu_seconds_total
node_memory_MemAvailable_bytes
node_disk_io_time_seconds_total
# Pod metrics
container_cpu_usage_seconds_total
container_memory_usage_bytes
container_network_receive_bytes_total
# Kubernetes metrics
kube_pod_status_phase
kube_deployment_status_replicas
kube_node_status_condition# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod_name)
# Memory usage per node
sum(container_memory_usage_bytes) by (node)
# Pod restart count
kube_pod_container_status_restarts_total
# Deployment replica mismatch
kube_deployment_status_replicas_desired - kube_deployment_status_replicas_availableLogs menyediakan detailed record dari event.
# View pod logs
kubectl logs pod-name
# View logs dari specific container
kubectl logs pod-name -c container-name
# Stream logs
kubectl logs -f pod-name
# View previous logs (jika pod crash)
kubectl logs pod-name --previousELK Stack (Elasticsearch, Logstash, Kibana)
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch
helm install kibana elastic/kibana
helm install logstash elastic/logstashFluentd Configuration
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<match kubernetes.**>
@type elasticsearch
@id output_elasticsearch
@log_level info
include_tag_key true
host elasticsearch
port 9200
path_prefix logstash
logstash_format true
logstash_prefix logstash
logstash_prefix_separator _
include_timestamp false
type_name _doc
</match>{
"timestamp": "2026-03-01T10:30:45.123Z",
"level": "INFO",
"service": "web-app",
"pod": "web-app-5d4f7c6b9-abc12",
"namespace": "production",
"message": "Request processed",
"request_id": "req-12345",
"duration_ms": 145,
"status_code": 200
}Traces track request yang flow melalui system.
Jaeger adalah distributed tracing platform.
Installation
kubectl create namespace jaeger
kubectl apply -n jaeger -f https://raw.githubusercontent.com/jaegertracing/jaeger-kubernetes/main/jaeger-all-in-one-template.ymlInstrumentation Example
from jaeger_client import Config
from opentracing_instrumentation.local_span import LocalSpanManager
def init_tracer(service_name):
config = Config(
config={
'sampler': {
'type': 'const',
'param': 1,
},
'logging': True,
},
service_name=service_name,
)
return config.initialize_tracer()
tracer = init_tracer('my-service')
with tracer.start_active_span('my-operation') as scope:
# Do work
pass# Propagate trace context across services
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: congo=t61rcZ94W243groups:
- name: kubernetes.rules
interval: 30s
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
for: 5m
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
annotations:
summary: "Node {{ $labels.node }} is not ready"
- alert: HighMemoryUsage
expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
for: 5m
annotations:
summary: "Pod {{ $labels.pod }} memory usage is above 90%"{
"dashboard": {
"title": "Kubernetes Cluster",
"panels": [
{
"title": "CPU Usage",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (pod_name)"
}
]
},
{
"title": "Memory Usage",
"targets": [
{
"expr": "sum(container_memory_usage_bytes) by (pod_name)"
}
]
},
{
"title": "Pod Restarts",
"targets": [
{
"expr": "kube_pod_container_status_restarts_total"
}
]
}
]
}
}# Find errors di production
{
"query": {
"bool": {
"must": [
{ "match": { "level": "ERROR" } },
{ "match": { "namespace": "production" } },
{ "range": { "timestamp": { "gte": "now-1h" } } }
]
}
}
}Problem: No visibility ke application behavior.
Solution: Add instrumentation:
from prometheus_client import Counter, Histogram
request_count = Counter('requests_total', 'Total requests')
request_duration = Histogram('request_duration_seconds', 'Request duration')
@request_duration.time()
def handle_request():
request_count.inc()
# Handle requestProblem: High storage cost dan performance impact.
Solution: Sample strategically:
# Sample 10% dari traces
sampler:
type: probabilistic
param: 0.1Problem: Issue go unnoticed.
Solution: Configure meaningful alert:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5mProblem: Logs fill up storage.
Solution: Set retention policy:
# Keep logs untuk 30 hari
index.lifecycle.name: logs
index.lifecycle.rollover_alias: logs
index.lifecycle.delete.min_age: 30dProblem: Can't connect metrics, logs, dan traces.
Solution: Use correlation ID:
# Include request ID di semua output
request_id: "req-12345"
trace_id: "trace-12345"{
"timestamp": "2026-03-01T10:30:45Z",
"level": "INFO",
"service": "web-app",
"request_id": "req-12345",
"message": "Request processed"
}from prometheus_client import Counter, Histogram
request_count = Counter('requests_total', 'Total requests')
request_duration = Histogram('request_duration_seconds', 'Request duration')- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1# Propagate across services
X-Request-ID: req-12345
X-Trace-ID: trace-12345# Monitor Prometheus itself
prometheus_tsdb_symbol_table_size_bytes
prometheus_tsdb_wal_corruptions_total# Keep metrics untuk 15 hari
retention: 15dCreate dashboard untuk different audience:
# Metrics Documentation
## request_duration_seconds
- Description: HTTP request duration
- Unit: seconds
- Labels: method, endpoint, status┌─────────────────────────────────────┐
│ Application │
│ (Instrumented with metrics) │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Data Collection │
│ - Prometheus (metrics) │
│ - Fluentd (logs) │
│ - Jaeger (traces) │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Storage │
│ - Prometheus TSDB │
│ - Elasticsearch │
│ - Jaeger backend │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Visualization & Alerting │
│ - Grafana (dashboards) │
│ - Kibana (logs) │
│ - AlertManager (alerts) │
└─────────────────────────────────────┘| Aspek | Monitoring | Observability |
|---|---|---|
| Focus | Known unknowns | Unknown unknowns |
| Approach | Predefined metrics | Exploratory analysis |
| Alerts | Threshold-based | Anomaly-based |
| Debugging | Limited | Comprehensive |
| Cost | Lower | Higher |
Pada episode 42 ini, kita telah membahas Observability di Kubernetes secara mendalam. Kita sudah belajar tentang metrics, logs, traces, dan best practice untuk implement observability.
Key takeaway:
Observability essential untuk operate production Kubernetes cluster securely dan efficiently.
Catatan
Untuk kalian yang ingin melanjutkan ke episode selanjutnya, bisa click thumbnail episode 43 di bawah ini