The Linux Kernel Deep Dive - Cgroups and Namespaces - Episode 2 of Linux Mastery Series

#Introduction

In Episode 1, we explored Linux fundamentals and history. Now we're diving deeper into the kernel—the core of Linux that makes modern containerization possible.

If you've ever wondered how Docker containers can run isolated processes on the same machine, or how Kubernetes manages thousands of containers without them interfering with each other, the answer lies in two powerful kernel technologies: namespaces and cgroups.

These aren't just academic concepts. They're the foundation of:

Docker and container technology
Kubernetes and orchestration
Cloud infrastructure
Microservices architecture
Modern DevOps practices

Understanding namespaces and cgroups is essential for anyone working with containers, cloud platforms, or infrastructure. In this episode, we'll demystify these technologies and show you how they work under the hood.

By the end, you'll understand how the kernel isolates processes, limits resources, and enables the containerized world we live in today.

#Understanding the Linux Kernel Architecture

#What Does the Kernel Do?

The Linux kernel is the core software that manages all hardware resources and mediates access between applications and hardware. Its primary responsibilities include:

Process management: Creating, scheduling, and terminating processes
Memory management: Allocating and managing RAM, virtual memory, and paging
File system: Managing files, directories, and storage devices
Device drivers: Interfacing with hardware (network cards, disks, USB devices)
Networking: Handling network protocols and communication
Security: Enforcing permissions, user isolation, and access control
Interrupt handling: Responding to hardware and software interrupts

The kernel runs in a privileged mode called "kernel space" and protects itself from user applications.

#Kernel Space vs. User Space

Linux divides memory and execution into two distinct spaces:

Kernel Space

Privileged execution mode with direct hardware access
Only the kernel runs here
Can execute any CPU instruction
Direct access to all memory and devices
Crashes here crash the entire system

User Space

Restricted execution mode for applications
All user applications run here
Limited hardware access (through system calls)
Memory isolation—can't access other processes' memory
Crashes here don't crash the system

This separation is crucial for system stability and security. Applications can't directly access hardware; they must request the kernel to do it.

#System Calls: The Bridge Between Worlds

When a user application needs to do something privileged (like reading a file or allocating memory), it makes a system call. This is the interface between user space and kernel space.

Common system calls include:

open(): Open a file
read(): Read from a file descriptor
write(): Write to a file descriptor
fork(): Create a new process
exec(): Execute a program
exit(): Terminate a process
mmap(): Map memory
socket(): Create a network socket

When a system call is made, the CPU switches from user mode to kernel mode, the kernel performs the operation, and then switches back to user mode. This context switching has a performance cost, which is why minimizing system calls is important for performance-critical code.

#Process Management and Scheduling

#What is a Process?

A process is a running instance of a program. Each process has:

Process ID (PID): Unique identifier for the process
Parent Process ID (PPID): The process that created this process
User ID (UID): The user who owns the process
Memory space: Isolated memory for the process
File descriptors: Open files and network connections
Environment variables: Configuration passed to the process
Working directory: Current directory for the process

Processes are isolated from each other. One process crashing doesn't affect others (in most cases).

#Process States and Lifecycle

A process goes through several states during its lifetime:

plaintext

Running → Waiting → Stopped → Zombie → Terminated

Running: Currently executing on CPU
Waiting/Sleeping: Waiting for I/O or event (interruptible or uninterruptible)
Stopped: Paused by a signal (SIGSTOP)
Zombie: Process has exited but parent hasn't reaped it
Terminated: Process has exited and been cleaned up

#The Process Scheduler

The kernel's process scheduler decides which process runs on the CPU at any given time. On a multi-core system, multiple processes can run simultaneously (one per core).

The scheduler uses:

Priority levels: Processes have different priorities (nice values from -20 to 19)
Time slices: Each process gets a small amount of CPU time (quantum)
Preemption: The scheduler can interrupt a running process to give CPU time to another

This is why a single-core system can run thousands of processes—they're rapidly switching between them.

#Process Hierarchy and Init System

Processes form a tree hierarchy:

plaintext

init (PID 1)
├── systemd-journal
├── systemd-logind
├── sshd
│   └── bash (user session)
│       └── vim
└── nginx
    ├── nginx (worker)
    └── nginx (worker)

The first process is init (PID 1), which is the parent of all other processes. Modern Linux systems use systemd as the init system, which manages services, dependencies, and system startup.

When a parent process terminates, its children become orphans and are adopted by init. This prevents zombie processes from accumulating.

#Memory Management

#Virtual Memory

Every process has its own virtual address space. This is a key Linux feature that provides:

Isolation: Processes can't access each other's memory
Protection: The kernel prevents unauthorized memory access
Flexibility: Programs can use more memory than physically available

Virtual addresses are mapped to physical memory by the Memory Management Unit (MMU). A process thinks it has access to a large, contiguous memory space, but the kernel maps it to fragmented physical memory.

#Memory Allocation and Paging

When a process needs memory:

Allocation: The kernel allocates virtual memory
Demand paging: Physical memory is allocated only when the process actually uses it
Page faults: If the process accesses memory not in physical RAM, a page fault occurs
Paging: The kernel loads the page from disk (swap) into RAM

This allows systems to run processes that collectively use more memory than physically available.

#Swap Space

Swap is disk space used as an extension of RAM. When physical memory is full:

Least-used pages are moved to swap (disk)
When needed again, they're moved back to RAM
This allows the system to handle memory pressure

However, swap is much slower than RAM (disk I/O is ~1000x slower). Excessive swapping causes severe performance degradation. Modern systems try to minimize swapping through better memory management and cgroup limits.

#Introduction to Namespaces

#What are Namespaces?

Namespaces are a kernel feature that partitions system resources so that processes can have isolated views of the system. Instead of all processes seeing the same system resources, each namespace provides a separate view.

Think of namespaces like virtual worlds. Multiple processes can exist in different namespaces, each seeing a different version of the system. A process in one namespace can't see or interact with resources in another namespace.

Namespaces are the foundation of container isolation. Docker and Kubernetes use namespaces to create isolated environments for containers.

#Types of Namespaces

Linux provides several types of namespaces, each isolating different system resources:

#PID Namespace

Isolates process IDs. Each PID namespace has its own process tree with its own PID 1 (init process).

Use case: Containers see their own process tree, not the host's processes Example: A container's init process has PID 1 inside the container, but might be PID 12345 on the host

#Network Namespace

Isolates network resources: network interfaces, IP addresses, routing tables, firewall rules.

Use case: Each container has its own network stack Example: A container can have its own IP address, ports, and network configuration separate from the host

#Mount Namespace

Isolates the filesystem. Each namespace can have a different view of the filesystem hierarchy.

Use case: Containers have their own root filesystem Example: A container's / points to a container image, not the host's root filesystem

#IPC Namespace

Isolates Inter-Process Communication resources: message queues, shared memory, semaphores.

Use case: Processes in different IPC namespaces can't communicate via IPC Example: Two containers can't share memory or message queues

#UTS Namespace

Isolates hostname and domain name.

Use case: Each container can have its own hostname Example: A container can have hostname "web-server" while the host is "production-01"

#User Namespace

Isolates user and group IDs. A process can be root (UID 0) inside a user namespace but a regular user on the host.

Use case: Containers can run as root inside but be unprivileged on the host Example: Container root (UID 0 in namespace) maps to UID 1000 on the host

#Cgroup Namespace

Isolates the cgroup hierarchy (we'll cover cgroups next).

Use case: Processes see a simplified cgroup view Example: A container sees its cgroup as / instead of /docker/container-id

#How Namespaces Enable Isolation

Namespaces work by providing separate views of system resources. When a process is created in a namespace:

It inherits the namespace from its parent
It can only see resources in its namespace
It can't access resources in other namespaces
The kernel enforces this isolation

This is how Docker containers can:

Run their own init process (PID 1)
Have their own network interfaces and IP addresses
Have their own filesystem
Have their own hostname
All on the same physical machine without interfering with each other

#Introduction to Cgroups

#What are Cgroups?

Cgroups (control groups) are a kernel feature that limits, prioritizes, and isolates resource usage of process groups. While namespaces provide isolation (you can't see other resources), cgroups provide resource limits (you can't use more than allowed).

Cgroups allow you to:

Limit CPU usage
Limit memory usage
Limit I/O bandwidth
Limit network bandwidth
Control device access
Prioritize resource allocation

Without cgroups, a single process could consume all CPU or memory, starving other processes. Cgroups prevent this.

#Cgroups v1 vs. Cgroups v2

Cgroups v1 (legacy)

Multiple independent hierarchies
Each resource type (CPU, memory, I/O) has its own hierarchy
Complex to manage
Still widely used

Cgroups v2 (modern)

Single unified hierarchy
All resource types in one tree
Simpler to manage
Better performance
Becoming the standard (systemd uses it)

Most modern systems are transitioning to cgroups v2, but v1 is still common in production.

#Resource Limits with Cgroups

#CPU Limits

Control how much CPU time a process group can use:

plaintext

cpu.max = "50000 100000"  # 50% of one CPU core
cpu.weight = 100          # CPU scheduling weight (1-10000)

#Memory Limits

Control how much memory a process group can use:

plaintext

memory.max = "512M"       # Hard limit: 512 MB
memory.high = "256M"      # Soft limit: triggers reclaim at 256 MB
memory.swap.max = "0"     # Disable swap for this cgroup

#I/O Limits

Control disk I/O bandwidth:

plaintext

io.max = "8:0 rbps=10485760 wbps=10485760"  # 10 MB/s read and write

#Device Access Control

Control which devices a process can access:

plaintext

devices.allow = "c 1:3 rw"   # Allow /dev/null (character device 1:3)
devices.deny = "b 8:* rwm"   # Deny all block devices

#Cgroup Hierarchy

Cgroups form a tree hierarchy. Each cgroup can have child cgroups, and resource limits are inherited and enforced at each level:

plaintext

/
├── system.slice
│   ├── systemd-logind.service
│   └── sshd.service
├── user.slice
│   └── user-1000.slice
│       └── session-1.scope
└── docker
    ├── container-1
    │   └── memory.max = 512M
    └── container-2
        └── memory.max = 1G

A process's actual limits are determined by all cgroups in its path from root to leaf.

#Containers: Namespaces and Cgroups in Action

#How Docker Uses Namespaces and Cgroups

Docker combines namespaces and cgroups to create isolated, resource-limited containers:

Namespaces provide isolation: Each container has its own PID, network, mount, IPC, UTS, and user namespaces
Cgroups provide limits: Each container is limited to specific CPU, memory, and I/O resources
Union filesystems provide layered storage: Container images are built from layers

When you run docker run, Docker:

Creates new namespaces for the container
Sets up cgroup limits
Mounts the container filesystem
Starts the container process in the new namespaces

The container process thinks it's running on its own machine, but it's actually sharing the host kernel with other containers.

#Container Isolation in Practice

Let's trace what happens when you run a container:

bash

docker run --name web --cpus 1 --memory 512m nginx

PID Namespace: The nginx process gets PID 1 inside the container (but might be PID 5432 on the host)
Network Namespace: The container gets its own network interface with its own IP address
Mount Namespace: The container sees / as the nginx image root, not the host's root
Cgroup limits: The container is limited to 1 CPU core and 512 MB of memory
User Namespace: The container's root user maps to an unprivileged user on the host (if configured)

The container is completely isolated from other containers and the host, yet they all share the same kernel.

#Resource Constraints in Containers

When you specify resource limits in Docker or Kubernetes, you're setting cgroup limits:

bash

# Docker: Limit to 2 CPUs and 1 GB memory
docker run --cpus 2 --memory 1g myapp
 
# Kubernetes: Set resource requests and limits
resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "1000m"
    memory: "512Mi"

These limits are enforced by cgroups. If a container tries to exceed its memory limit, the kernel kills the process (OOMKill). If it tries to use more CPU than allocated, it's throttled.

#Practical Examples: Working with Namespaces and Cgroups

#Viewing Namespaces

You can inspect namespaces on your system:

# List all namespaces for a process
ls -la /proc/1/ns/
 
# Show namespace IDs
readlink /proc/1/ns/*
 
# Compare namespaces between processes
diff <(readlink /proc/1/ns/*) <(readlink /proc/self/ns/*)
 
# List all processes and their namespaces
ps aux | head -5

# List all namespaces for a process
ls -la /proc/1/ns/
 
# Show namespace IDs
readlink /proc/1/ns/*
 
# Compare namespaces between processes
diff <(readlink /proc/1/ns/*) <(readlink /proc/self/ns/*)
 
# List all processes and their namespaces
ps aux | head -5

#Creating Isolated Processes

You can create a process in a new namespace using unshare:

# Create a new PID namespace
sudo unshare --pid --fork /bin/bash
 
# Inside the new namespace, PID 1 is bash
ps aux
 
# Create a new network namespace
sudo unshare --net /bin/bash
 
# Inside, you have isolated network interfaces
ip link show

# Create a new PID namespace
sudo unshare --pid --fork /bin/bash
 
# Inside the new namespace, PID 1 is bash
ps aux
 
# Create a new network namespace
sudo unshare --net /bin/bash
 
# Inside, you have isolated network interfaces
ip link show

#Setting Resource Limits

You can set cgroup limits using systemd-run:

# Run a process with CPU limit (50% of one core)
systemd-run --scope -p CPUQuota=50% stress-ng --cpu 1
 
# Run a process with memory limit (256 MB)
systemd-run --scope -p MemoryLimit=256M myapp
 
# Run with both limits
systemd-run --scope -p CPUQuota=50% -p MemoryLimit=512M myapp

# Run a process with CPU limit (50% of one core)
systemd-run --scope -p CPUQuota=50% stress-ng --cpu 1
 
# Run a process with memory limit (256 MB)
systemd-run --scope -p MemoryLimit=256M myapp
 
# Run with both limits
systemd-run --scope -p CPUQuota=50% -p MemoryLimit=512M myapp

#Monitoring Cgroup Usage

Monitor resource usage of cgroups:

# View cgroup v2 memory usage
cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max
 
# View CPU usage
cat /sys/fs/cgroup/cpu.stat
 
# Monitor in real-time
watch -n 1 'cat /sys/fs/cgroup/memory.current'
 
# For Docker containers
docker stats

# View cgroup v1 memory usage
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
 
# View CPU usage
cat /sys/fs/cgroup/cpuacct/cpuacct.usage
 
# Monitor in real-time
watch -n 1 'cat /sys/fs/cgroup/memory/memory.usage_in_bytes'
 
# For Docker containers
docker stats

Tip

The path to cgroup files differs between cgroups v1 and v2. Most modern systems use cgroups v2 at /sys/fs/cgroup/, while older systems use v1 at /sys/fs/cgroup/<resource>/.

#Common Mistakes and Pitfalls

#Misconfiguring Resource Limits

Mistake: Setting memory limits too low, causing OOMKill

yaml

# BAD: 128 MB is too low for most applications
resources:
  limits:
    memory: "128Mi"

Why it happens: Underestimating application memory needs or trying to pack too many containers

How to avoid it:

Monitor actual memory usage before setting limits
Set requests and limits appropriately
Use docker stats or Kubernetes metrics to understand usage

#Not Understanding Namespace Inheritance

Mistake: Assuming a process in a container can access the host's network

Why it happens: Misunderstanding how network namespaces work

How to avoid it: Remember that containers have isolated network namespaces. To access the host network, use --network host in Docker or hostNetwork: true in Kubernetes.

#Ignoring Memory Pressure

Mistake: Not accounting for memory pressure and swap usage

Why it happens: Assuming memory limits are hard stops (they're not—swap can extend them)

How to avoid it:

Disable swap in containers (memory.swap.max = 0)
Monitor swap usage
Set appropriate memory limits

#Forgetting About Swap

Mistake: Allowing unlimited swap, causing performance degradation

Why it happens: Default system configuration allows swap

How to avoid it:

Explicitly disable swap for containers
Monitor swap usage on the host
Ensure sufficient physical memory for your workloads

#Best Practices for Kernel-Level Resource Management

#Production-Grade Configuration

Set appropriate resource requests and limits:

yaml

# Kubernetes example
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Requests are what the scheduler uses for placement. Limits are hard caps enforced by cgroups.

Disable swap for predictable performance:

bash

# In Docker
docker run --memory-swap 0 myapp
 
# In Kubernetes
securityContext:
  capabilities:
    add:
      - SYS_RESOURCE

Use resource quotas in Kubernetes:

yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"

#Monitoring and Observability

Monitor cgroup metrics:

CPU usage and throttling
Memory usage and OOMKill events
I/O bandwidth and latency
Network bandwidth

Use tools like:

docker stats: Real-time container metrics
kubectl top: Kubernetes resource usage
Prometheus: Metrics collection and alerting
cAdvisor: Container metrics collection

Set up alerts for:

Memory approaching limits
CPU throttling
OOMKill events
Swap usage

#Security Considerations

Use user namespaces: Map container root to unprivileged user on host

bash

docker run --userns-remap=default myapp

Restrict device access: Only allow necessary devices

yaml

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

Use read-only filesystems: Prevent container from modifying its filesystem

yaml

securityContext:
  readOnlyRootFilesystem: true

#When NOT to Manually Configure Cgroups

#Use Container Orchestration Instead

Don't manually configure cgroups. Use Docker or Kubernetes instead:

bash

# DON'T do this manually
echo "512M" > /sys/fs/cgroup/memory/myapp/memory.limit_in_bytes
 
# DO this instead
docker run --memory 512m myapp

Container orchestration tools handle cgroup configuration for you, with better abstractions and error handling.

#Kubernetes Handles This for You

In Kubernetes, you specify resource requests and limits, and Kubernetes manages cgroups:

yaml

# Kubernetes handles the cgroup configuration
resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

You don't need to know about cgroups to use Kubernetes effectively. The orchestrator abstracts away the complexity.

#Key Takeaways

Namespaces provide isolation: Each container has isolated views of PID, network, filesystem, IPC, hostname, and users
Cgroups provide resource limits: CPU, memory, I/O, and device access are controlled and limited
Containers combine both: Docker and Kubernetes use namespaces for isolation and cgroups for resource management
The kernel enforces isolation: Processes in different namespaces can't see or interfere with each other
Understanding these concepts is essential: For DevOps, cloud engineering, and infrastructure work
Don't manually configure cgroups: Use Docker or Kubernetes instead—they handle the complexity

#Next Steps

Explore namespaces: Use ls /proc/1/ns/ to see namespaces on your system
Experiment with containers: Run Docker containers and inspect their namespaces
Monitor cgroups: Use docker stats to see resource usage
Read the kernel documentation: /usr/share/doc/linux-doc/ or kernel.org
Continue the series: Move to Episode 3: Permissions, Users & Groups to understand user isolation

Understanding namespaces and cgroups is the key to mastering containerization. These concepts apply whether you're using Docker, Kubernetes, or any other container platform.

Ready for the next episode? Continue with Episode 3: Permissions, Users & Groups to master file permissions and user management, which are crucial for container security.

The Linux Kernel Deep Dive - Cgroups and Namespaces - Episode 2 of Linux Mastery Series

Related Posts