Skip to content

cgroups v2 & systemd-run - Resource Control and Sandboxing for Raspberry Pi

Introduction

Control Groups (cgroups) v2 is a unified Linux kernel feature that provides hierarchical resource management and isolation for processes. Combined with systemd-run, it enables powerful process control, resource limiting, and sandboxing without requiring full containerization.

This technology transforms how you manage system resources:

  • Resource Limiting: Set precise CPU, memory, and I/O limits per process
  • Priority Control: Ensure critical services get resources they need
  • Resource Isolation: Prevent runaway processes from affecting the system
  • Accounting: Track exact resource consumption per service
  • Dynamic Control: Adjust limits on running processes without restart
  • Sandboxing: Restrict process capabilities and filesystem access

Why cgroups v2 matters for Raspberry Pi:

  • Limited Resources: Make the most of limited RAM and CPU
  • Multiple Services: Run many services without interference
  • Development: Test resource-constrained scenarios
  • Education: Understand container technology foundations
  • Stability: Prevent memory/CPU exhaustion crashes
  • Fair Sharing: Ensure responsive system under load

Common use cases:

  • Build Jobs: Limit compilation processes to prevent system freezes
  • Media Encoding: Control CPU/memory for ffmpeg/handbrake
  • Database Services: Guarantee minimum resources for critical services
  • Development Environments: Isolate test environments
  • Backup Operations: Limit I/O impact during backups
  • Web Services: Prevent DoS from resource exhaustion
  • Batch Processing: Run background jobs with limited resources
  • Container Understanding: Learn Docker/Podman internals

This comprehensive guide covers:

  • cgroups v2 Fundamentals: Unified hierarchy and controllers
  • systemd Integration: Native cgroup management with systemd
  • CPU Control: Shares, quotas, and affinity management
  • Memory Limits: Hard limits, soft limits, and OOM behavior
  • I/O Control: Bandwidth limiting and priority scheduling
  • systemd-run: Ad-hoc service execution with resource limits
  • Sandboxing: Filesystem, network, and capability restrictions
  • Monitoring: Real-time resource usage tracking
  • Production Examples: Database, web server, build systems
  • Performance Tuning: Optimal configurations for Raspberry Pi

Perfect for:

  • System Administrators: Multi-tenant system management
  • DevOps Engineers: Understanding container resource control
  • Developers: Resource-constrained testing environments
  • Students: Learning modern Linux resource management
  • Home Lab Enthusiasts: Running multiple services efficiently
  • Embedded Engineers: Deterministic resource allocation
  • Performance Engineers: Controlling resource contention

Understanding cgroups v2

cgroups v1 vs v2

cgroups v1 (Legacy - Multiple Hierarchies)
├── cpu/                    (CPU controller)
│   ├── system.slice
│   └── user.slice
├── memory/                 (Memory controller)
│   ├── system.slice
│   └── user.slice
├── blkio/                  (Block I/O controller)
│   └── ...
└── ...                     (12+ separate hierarchies)

Problems:
- Controllers in different hierarchies
- Inconsistent behavior
- Complex management
- Resource conflicts

────────────────────────────────────────────────────

cgroups v2 (Unified Hierarchy)
/sys/fs/cgroup/
├── system.slice/
│   ├── cpu.max            (All controllers)
│   ├── memory.max         (in same)
│   ├── io.max             (hierarchy)
│   ├── ssh.service/
│   └── nginx.service/
├── user.slice/
│   └── user-1000.slice/
└── machine.slice/          (Virtual machines/containers)

Benefits:
✓ Single unified hierarchy
✓ Consistent controller interface
✓ Better resource coordination
✓ Simplified management

cgroups v2 Architecture

┌─────────────────────────────────────────────────────────┐
│                    Root cgroup (/)                       │
│                                                          │
│  Controllers: cpu, memory, io, pids                     │
│  All system processes start here                        │
└────────────────────┬────────────────────────────────────┘
        ┌────────────┼────────────┐
        │            │            │
        ▼            ▼            ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│system.slice  │ │user.slice    │ │machine.slice │
│              │ │              │ │              │
│ System       │ │ User         │ │ VMs/         │
│ services     │ │ sessions     │ │ Containers   │
└──────┬───────┘ └──────┬───────┘ └──────────────┘
       │                │
       │                ├─ user-1000.slice
       │                │  └─ session-1.scope
       │                │     ├─ app-1
       │                │     └─ app-2
       │                │
       ├─ nginx.service
       ├─ postgresql.service
       └─ myapp.service
          ├─ cpu.max: 50000 100000  (50% CPU)
          ├─ memory.max: 512M
          ├─ io.weight: 100
          └─ pids.max: 100

Available Controllers

Controller Purpose Key Files
cpu CPU time distribution cpu.max, cpu.weight
memory RAM usage limits memory.max, memory.high
io Block device I/O io.max, io.weight
pids Process/thread limits pids.max
cpuset CPU/memory node binding cpuset.cpus, cpuset.mems
rdma RDMA/IB resources (Advanced networking)
hugetlb Huge page allocation (Advanced memory)

Checking cgroups v2 Support

Verify cgroups v2 is Enabled

# Check if cgroups v2 is mounted
mount | grep cgroup
# Should show: cgroup2 on /sys/fs/cgroup type cgroup2

# Check cgroup version
stat -fc %T /sys/fs/cgroup/
# Output: cgroup2fs (v2) or tmpfs (v1)

# List available controllers
cat /sys/fs/cgroup/cgroup.controllers
# Example: cpuset cpu io memory hugetlb pids rdma

# Check systemd cgroup driver
systemctl --version | grep "systemd"
# Raspberry Pi OS uses systemd with cgroups v2

# Verify systemd is using unified hierarchy
cat /proc/cmdline | grep -o "systemd.unified_cgroup_hierarchy=[0-9]"
# Should show: systemd.unified_cgroup_hierarchy=1

Enable cgroups v2 (if needed)

Most recent Raspberry Pi OS versions use cgroups v2 by default. If not:

# Edit boot configuration
sudo nano /boot/firmware/cmdline.txt

# Add to the line (do not create new line):
systemd.unified_cgroup_hierarchy=1 cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1

# Reboot
sudo reboot

# Verify after reboot
stat -fc %T /sys/fs/cgroup/

CPU Control

Understanding CPU Limits

# CPU controller parameters:

# cpu.weight (1-10000, default 100)
# - Relative CPU share (like nice values)
# - Higher weight = more CPU time when contending

# cpu.max (quota period)
# - Absolute CPU limit
# - Format: "max_microseconds period_microseconds"
# - Example: "50000 100000" = 50% of one CPU core

# cpu.stat
# - Statistics: usage_usec, user_usec, system_usec

CPU Limiting with systemd-run

# Run a process with 50% CPU limit
systemd-run --unit=my-task \
    --property=CPUQuota=50% \
    stress --cpu 1

# Check status
systemctl status my-task

# View CPU usage
systemd-cgtop

# Stop
systemctl stop my-task

# Multiple CPUs - limit to 1.5 cores
systemd-run --unit=build-job \
    --property=CPUQuota=150% \
    make -j4

# CPU affinity - pin to specific cores
systemd-run --unit=pinned-task \
    --property=AllowedCPUs=0,1 \
    ./my-program

# Relative CPU weight (priority)
systemd-run --unit=low-priority \
    --property=CPUWeight=50 \
    ./background-task

systemd-run --unit=high-priority \
    --property=CPUWeight=200 \
    ./important-task

Persistent CPU Limits for Services

# Create service with CPU limits
sudo systemctl edit --force --full my-service.service
[Unit]
Description=CPU-Limited Service

[Service]
Type=simple
ExecStart=/usr/local/bin/my-app

# CPU Limits
CPUQuota=75%                    # Max 75% of one core
CPUWeight=100                   # Default relative weight
CPUAffinity=0 1                 # Pin to cores 0 and 1

# Accounting
CPUAccounting=yes

[Install]
WantedBy=multi-user.target
1
2
3
4
5
6
7
8
9
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable --now my-service

# Monitor CPU usage
systemd-cgtop /system.slice/my-service.service

# View detailed stats
systemctl show my-service | grep CPU

Memory Control

Memory Limit Types

# memory.max
# - Hard limit: Process killed (OOM) if exceeded
# - Format: bytes or with unit (512M, 2G)

# memory.high
# - Soft limit: Throttling if exceeded
# - System tries to reclaim memory
# - More graceful than memory.max

# memory.low
# - Best-effort protection
# - Memory not reclaimed unless necessary

# memory.min
# - Hard protection
# - Memory never reclaimed

# memory.swap.max
# - Swap space limit
# - Default: unlimited

Memory Limiting with systemd-run

# Hard memory limit (512MB)
systemd-run --unit=limited-app \
    --property=MemoryMax=512M \
    python3 memory-hungry-app.py

# Soft limit (warning at 400MB, hard limit 512MB)
systemd-run --unit=throttled-app \
    --property=MemoryHigh=400M \
    --property=MemoryMax=512M \
    ./app

# Memory protection (guarantee 256MB minimum)
systemd-run --unit=protected-db \
    --property=MemoryMin=256M \
    --property=MemoryMax=1G \
    postgres

# Disable swap for this process
systemd-run --unit=no-swap-app \
    --property=MemorySwapMax=0 \
    ./realtime-app

# Monitor memory usage
systemd-cgtop /system.slice/limited-app.service

Handling OOM (Out Of Memory)

# Create service that handles OOM gracefully
sudo systemctl edit --force --full oom-test.service
[Unit]
Description=OOM Test Service

[Service]
Type=simple
ExecStart=/usr/bin/stress --vm 1 --vm-bytes 1G

# Memory limits
MemoryMax=512M
MemoryAccounting=yes

# OOM policy
OOMPolicy=stop          # Options: continue, stop, kill
OOMScoreAdjust=100      # Make more likely to be killed (default 0, range -1000 to 1000)

# Restart on OOM
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
1
2
3
4
5
6
# Start and watch
sudo systemctl start oom-test
journalctl -u oom-test -f

# Check OOM kills
dmesg | grep -i oom

I/O Control

I/O Bandwidth Limiting

# Find device major:minor numbers
lsblk
# Example output:
# NAME        MAJ:MIN
# mmcblk0     179:0
# ├─mmcblk0p1 179:1
# └─mmcblk0p2 179:2

# Limit read bandwidth to 10 MB/s
systemd-run --unit=io-limited \
    --property=IOReadBandwidthMax="/dev/mmcblk0 10M" \
    dd if=/dev/mmcblk0 of=/dev/null bs=1M

# Limit write bandwidth to 5 MB/s
systemd-run --unit=backup-job \
    --property=IOWriteBandwidthMax="/dev/mmcblk0 5M" \
    rsync -av /source/ /backup/

# Combined read/write limits
systemd-run --unit=controlled-io \
    --property=IOReadBandwidthMax="/dev/mmcblk0 20M" \
    --property=IOWriteBandwidthMax="/dev/mmcblk0 10M" \
    ./disk-intensive-app

# I/O weight (relative priority, 1-10000)
systemd-run --unit=background-backup \
    --property=IOWeight=50 \
    rsync -av /data/ /backup/

systemd-run --unit=important-db \
    --property=IOWeight=500 \
    postgres

I/O Latency Control

1
2
3
4
5
6
7
8
# Set I/O latency target (microseconds)
systemd-run --unit=low-latency \
    --property=IODeviceLatencyTargetSec="/dev/mmcblk0 10ms" \
    ./latency-sensitive-app

# Monitor I/O stats
systemd-cgtop -d 1
# Shows: IO Read, IO Write per cgroup

Persistent I/O Limits

# /etc/systemd/system/backup.service
[Unit]
Description=Nightly Backup Service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh

# I/O Limits (reduce impact on system)
IOAccounting=yes
IOWeight=10                                      # Low priority
IOReadBandwidthMax=/dev/mmcblk0 5M              # Limit reads
IOWriteBandwidthMax=/dev/mmcblk0 10M            # Limit writes

[Install]
WantedBy=multi-user.target

Process Limits

PID Limits

# Limit number of processes/threads
systemd-run --unit=fork-bomb-protection \
    --property=TasksMax=50 \
    ./my-app

# Unlimited tasks
systemd-run --unit=many-threads \
    --property=TasksMax=infinity \
    ./multithreaded-app

# Check current task count
systemctl status my-app | grep Tasks
# Tasks: 23 (limit: 50)

Sandboxing with systemd

Filesystem Sandboxing

# Read-only root filesystem
systemd-run --unit=readonly-fs \
    --property=ProtectSystem=strict \
    --property=ReadWritePaths=/var/lib/myapp \
    ./my-app

# Isolate /tmp
systemd-run --unit=private-tmp \
    --property=PrivateTmp=yes \
    ./app

# Hide /home directories
systemd-run --unit=no-home \
    --property=ProtectHome=yes \
    ./service

# Restrict path visibility
systemd-run --unit=restricted \
    --property=InaccessiblePaths=/root \
    --property=InaccessiblePaths=/boot \
    ./app

Network Sandboxing

# Disable network access
systemd-run --unit=offline-app \
    --property=PrivateNetwork=yes \
    ./compute-task

# Isolate network namespace (private loopback)
systemd-run --unit=isolated-net \
    --property=PrivateNetwork=yes \
    python3 -m http.server 8080
# Accessible only within this network namespace

Capability Restrictions

# Drop all capabilities
systemd-run --unit=no-caps \
    --property=CapabilityBoundingSet= \
    ./app

# Grant specific capabilities
systemd-run --unit=bind-low-port \
    --property=AmbientCapabilities=CAP_NET_BIND_SERVICE \
    --property=User=nobody \
    ./webserver --port 80

# Common capabilities:
# CAP_NET_BIND_SERVICE - Bind to ports < 1024
# CAP_NET_RAW - Raw sockets (ping)
# CAP_SYS_ADMIN - Mount, etc. (dangerous!)

Complete Sandbox Example

# /etc/systemd/system/sandboxed-web.service
[Unit]
Description=Sandboxed Web Application

[Service]
Type=simple
ExecStart=/usr/local/bin/webapp
User=webapp
Group=webapp

# Resource Limits
CPUQuota=50%
MemoryMax=512M
TasksMax=100
IOWeight=100

# Filesystem Sandboxing
ProtectSystem=strict                    # Read-only /usr, /boot, /etc
ProtectHome=yes                         # No access to /home
PrivateTmp=yes                          # Private /tmp
ReadWritePaths=/var/lib/webapp          # Only writable location
NoNewPrivileges=yes                     # Can't gain privileges

# Network Sandboxing
IPAddressDeny=any                       # Block all by default
IPAddressAllow=localhost                # Allow localhost
IPAddressAllow=192.168.1.0/24          # Allow local network

# Kernel/System Protection
ProtectKernelTunables=yes               # /proc/sys, /sys read-only
ProtectKernelModules=yes                # Can't load kernel modules
ProtectControlGroups=yes                # cgroup hierarchy read-only
ProtectKernelLogs=yes                   # No access to kernel logs

# Capabilities
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE

# System Calls
SystemCallFilter=@system-service        # Whitelist common syscalls
SystemCallFilter=~@privileged           # Blacklist privileged syscalls
SystemCallErrorNumber=EPERM             # Return permission denied

# Misc
RestrictRealtime=yes                    # No realtime scheduling
RestrictNamespaces=yes                  # Can't create namespaces
LockPersonality=yes                     # Prevent execution domain changes

[Install]
WantedBy=multi-user.target

Monitoring and Inspection

systemd-cgtop - Real-Time Resource Monitor

# Interactive resource monitor
systemd-cgtop

# Output example:
# Control Group                    Tasks   %CPU   Memory  Input/s Output/s
# /                                  234   12.5     1.2G        -        -
# /system.slice                       89    8.2   512.0M   1.2M/s   500K/s
# /system.slice/nginx.service         12    2.1    64.0M        -   100K/s
# /system.slice/postgres.service      25    5.8   256.0M   800K/s   200K/s
# /user.slice                         98    4.1   512.0M        -        -

# Update every second
systemd-cgtop -d 1

# Show specific cgroup
systemd-cgtop /system.slice/nginx.service

# Sort by memory
systemd-cgtop --order=memory

# Sort by CPU
systemd-cgtop --order=cpu

Inspecting cgroup Settings

# Show all properties of a service
systemctl show nginx.service

# Filter for resource properties
systemctl show nginx.service | grep -E "(CPU|Memory|IO|Tasks)"

# View cgroup files directly
ls /sys/fs/cgroup/system.slice/nginx.service/
cat /sys/fs/cgroup/system.slice/nginx.service/memory.current
cat /sys/fs/cgroup/system.slice/nginx.service/cpu.stat

# View process cgroup membership
cat /proc/PID/cgroup
# Example: 0::/system.slice/nginx.service

Resource Accounting Statistics

# CPU usage statistics
systemctl show nginx.service | grep CPUUsage
# CPUUsageNSec=1234567890000 (nanoseconds)

# Memory usage
systemctl show nginx.service | grep Memory
# MemoryCurrent=67108864 (bytes)
# MemoryPeak=134217728

# I/O statistics
systemctl show nginx.service | grep IO
# IOReadBytes=1048576000
# IOWriteBytes=524288000

# Task count
systemctl show nginx.service | grep Tasks
# TasksCurrent=12
# TasksMax=100

Practical Examples

Build System with Resource Limits

#!/bin/bash
# build-with-limits.sh - Compile project with resource constraints

PROJECT_DIR="/home/pi/myproject"
BUILD_DIR="$PROJECT_DIR/build"

echo "Starting build with resource limits..."

systemd-run \
    --unit=project-build-$(date +%s) \
    --collect \
    --wait \
    --working-directory="$PROJECT_DIR" \
    --property=CPUQuota=75% \
    --property=MemoryMax=1G \
    --property=IOWeight=50 \
    --property=Nice=10 \
    bash -c "
        mkdir -p build
        cd build
        cmake ..
        make -j2
    "

BUILD_STATUS=$?

if [ $BUILD_STATUS -eq 0 ]; then
    echo "Build completed successfully!"
else
    echo "Build failed with status $BUILD_STATUS"
fi

exit $BUILD_STATUS

Media Encoding Pipeline

#!/bin/bash
# encode-videos.sh - Batch video encoding with resource limits

INPUT_DIR="/home/pi/videos/input"
OUTPUT_DIR="/home/pi/videos/output"

mkdir -p "$OUTPUT_DIR"

# Process each video file
find "$INPUT_DIR" -type f -name "*.mp4" | while read -r VIDEO; do
    BASENAME=$(basename "$VIDEO" .mp4)
    OUTPUT_FILE="$OUTPUT_DIR/${BASENAME}_compressed.mp4"

    echo "Encoding: $BASENAME"

    # Run ffmpeg with limits (prevent system freeze)
    systemd-run \
        --unit="encode-$BASENAME" \
        --collect \
        --wait \
        --property=CPUQuota=60% \
        --property=MemoryMax=512M \
        --property=IOWeight=30 \
        --property=Nice=15 \
        ffmpeg -i "$VIDEO" \
            -c:v libx264 \
            -crf 28 \
            -preset medium \
            -c:a aac \
            -b:a 128k \
            "$OUTPUT_FILE"

    if [ $? -eq 0 ]; then
        echo "✓ Completed: $BASENAME"
    else
        echo "✗ Failed: $BASENAME"
    fi
done

echo "All encoding jobs completed"

Database Service with Guaranteed Resources

# /etc/systemd/system/postgresql-custom.service
[Unit]
Description=PostgreSQL with Resource Guarantees
After=network.target

[Service]
Type=notify
User=postgres
Group=postgres

ExecStart=/usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf

# CPU
CPUAccounting=yes
CPUWeight=200                   # High priority (default 100)
CPUQuota=150%                   # Up to 1.5 cores

# Memory
MemoryAccounting=yes
MemoryMin=512M                  # Guaranteed minimum
MemoryHigh=1G                   # Soft limit (throttle)
MemoryMax=2G                    # Hard limit (OOM)
MemorySwapMax=0                 # No swap

# I/O
IOAccounting=yes
IOWeight=500                    # High I/O priority
IOReadBandwidthMax=/dev/mmcblk0 50M
IOWriteBandwidthMax=/dev/mmcblk0 30M

# Processes
TasksMax=200

# Restart policy
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Containerized Development Environment

#!/bin/bash
# dev-env.sh - Isolated development environment

DEV_ROOT="/tmp/dev-env-$$"
mkdir -p "$DEV_ROOT"/{bin,lib,tmp,workspace}

# Copy essential binaries
cp /bin/bash "$DEV_ROOT/bin/"
cp /bin/ls "$DEV_ROOT/bin/"
cp /usr/bin/python3 "$DEV_ROOT/bin/"

# Copy shared libraries
for BIN in "$DEV_ROOT/bin/"*; do
    ldd "$BIN" 2>/dev/null | grep -o '/lib[^ ]*' | while read -r LIB; do
        mkdir -p "$DEV_ROOT$(dirname $LIB)"
        cp "$LIB" "$DEV_ROOT$LIB" 2>/dev/null
    done
done

echo "Starting isolated development environment..."

systemd-run \
    --unit="dev-env-$$" \
    --pty \
    --same-dir \
    --collect \
    --property=CPUQuota=100% \
    --property=MemoryMax=512M \
    --property=PrivateNetwork=yes \
    --property=PrivateTmp=yes \
    --property=ProtectHome=yes \
    --property=RootDirectory="$DEV_ROOT" \
    --property=BindReadOnlyPaths="/home/pi/projects:$DEV_ROOT/workspace" \
    /bin/bash

# Cleanup on exit
rm -rf "$DEV_ROOT"

Batch Processing with Dynamic Scaling

#!/usr/bin/env python3
# batch-processor.py - Process jobs with adaptive resource limits

import subprocess
import sys
import time

def get_cpu_usage():
    """Get current system CPU usage percentage"""
    with open('/proc/stat', 'r') as f:
        line = f.readline()
        fields = [int(x) for x in line.split()[1:]]
        idle = fields[3]
        total = sum(fields)

    time.sleep(0.1)

    with open('/proc/stat', 'r') as f:
        line = f.readline()
        fields = [int(x) for x in line.split()[1:]]
        idle2 = fields[3]
        total2 = sum(fields)

    idle_delta = idle2 - idle
    total_delta = total2 - total
    usage = 100.0 * (1.0 - idle_delta / total_delta)

    return usage

def process_job(job_id, job_command, cpu_quota, memory_limit):
    """Process a single job with resource limits"""
    print(f"Starting job {job_id} (CPU: {cpu_quota}%, Memory: {memory_limit})")

    cmd = [
        'systemd-run',
        f'--unit=batch-job-{job_id}',
        '--collect',
        '--wait',
        f'--property=CPUQuota={cpu_quota}%',
        f'--property=MemoryMax={memory_limit}',
        '--property=IOWeight=30',
        'bash', '-c', job_command
    ]

    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.returncode == 0:
        print(f"✓ Job {job_id} completed")
    else:
        print(f"✗ Job {job_id} failed: {result.stderr}")

    return result.returncode

def main():
    jobs = [
        "sleep 5 && echo 'Job 1 done'",
        "stress --cpu 1 --timeout 10",
        "dd if=/dev/zero of=/tmp/test bs=1M count=100",
        "sleep 3 && echo 'Job 4 done'",
    ]

    for idx, job in enumerate(jobs, 1):
        # Adaptive resource allocation based on system load
        cpu_usage = get_cpu_usage()

        if cpu_usage < 30:
            cpu_quota = 100  # System idle, use full core
            memory_limit = "512M"
        elif cpu_usage < 70:
            cpu_quota = 50   # Moderate load
            memory_limit = "256M"
        else:
            cpu_quota = 25   # High load, be conservative
            memory_limit = "128M"

        print(f"System CPU usage: {cpu_usage:.1f}%")
        process_job(idx, job, cpu_quota, memory_limit)
        time.sleep(1)

if __name__ == '__main__':
    main()

Advanced Techniques

Delegating cgroup Control to Users

1
2
3
4
5
6
7
8
# Allow user to manage their own cgroups
sudo mkdir -p /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice
sudo chown -R pi:pi /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice

# User can now create sub-cgroups
systemd-run --user --unit=my-app \
    --property=CPUQuota=50% \
    ./my-program

Freezing and Thawing Processes

# Freeze all processes in a cgroup (pause execution)
systemctl kill -s STOP nginx.service

# Resume
systemctl kill -s CONT nginx.service

# Using cgroup freezer (if available)
echo 1 > /sys/fs/cgroup/system.slice/nginx.service/cgroup.freeze
# Resume
echo 0 > /sys/fs/cgroup/system.slice/nginx.service/cgroup.freeze

Memory Pressure Monitoring

# Watch memory pressure
cat /sys/fs/cgroup/system.slice/nginx.service/memory.pressure

# Output format (PSI - Pressure Stall Information):
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

# "some" = some processes delayed
# "full" = all processes delayed
# Higher values = memory pressure

# Monitor continuously
watch -n 1 cat /sys/fs/cgroup/system.slice/nginx.service/memory.pressure

CPU Pressure Monitoring

1
2
3
4
5
# CPU pressure (process wait times)
cat /sys/fs/cgroup/system.slice/nginx.service/cpu.pressure

# Monitor I/O pressure
cat /sys/fs/cgroup/system.slice/nginx.service/io.pressure

Troubleshooting

Controllers Not Available

1
2
3
4
5
6
7
8
9
# Check which controllers are available
cat /sys/fs/cgroup/cgroup.controllers

# If controllers missing, check kernel config
zcat /proc/config.gz | grep -i cgroup
# Should show CONFIG_CGROUPS=y, CONFIG_MEMCG=y, etc.

# Enable controllers for subtree
echo "+cpu +memory +io" > /sys/fs/cgroup/cgroup.subtree_control

Permission Denied Errors

# Can't write to cgroup files?

# Check ownership
ls -l /sys/fs/cgroup/system.slice/my-service.service/

# systemd manages cgroups - use systemctl
# Don't manually write to cgroup files under system.slice

# For user services, use --user flag
systemd-run --user --unit=my-app ./program

Resource Limits Not Applied

# Check if accounting is enabled
systemctl show my-service | grep Accounting

# Enable accounting
sudo systemctl set-property my-service.service \
    CPUAccounting=yes \
    MemoryAccounting=yes \
    IOAccounting=yes

# Or in service file:
# [Service]
# CPUAccounting=yes
# MemoryAccounting=yes
# IOAccounting=yes

OOM Kills Despite Memory Limit

# Check memory.events for OOM kills
cat /sys/fs/cgroup/system.slice/my-service.service/memory.events
# Look for "oom_kill" counter

# Check kernel messages
dmesg | grep -i "killed process"
journalctl -k | grep -i oom

# Possible causes:
# 1. Limit too low
# 2. Memory leak
# 3. Swap disabled (MemorySwapMax=0)

# Increase limit or add swap
sudo systemctl set-property my-service.service MemoryMax=1G

Performance Best Practices for Raspberry Pi

Optimal Resource Allocation

# Raspberry Pi 4 (4GB RAM example)

# Reserve resources for system
# System services: ~500MB RAM, 20% CPU

# Critical services (database, web server)
# - High CPUWeight (200-500)
# - Guaranteed MemoryMin
# - High IOWeight (200-500)

# Background services (backups, monitoring)
# - Low CPUWeight (10-50)
# - Best-effort memory
# - Low IOWeight (10-50)

# Example allocation:
# postgresql: CPUWeight=300, MemoryMin=512M, IOWeight=400
# nginx: CPUWeight=200, MemoryMin=256M, IOWeight=300
# backup: CPUWeight=20, MemoryMax=256M, IOWeight=10

Avoiding SD Card Wear

1
2
3
4
5
6
7
8
9
# Limit write-heavy services
sudo systemctl set-property logging-service.service \
    IOWriteBandwidthMax="/dev/mmcblk0 1M"

# Use tmpfs for temporary files
# PrivateTmp=yes in service files

# Batch writes
# Configure applications to flush less frequently

Temperature-Based Throttling

#!/bin/bash
# temp-aware-limits.sh - Adjust CPU limits based on temperature

SERVICE="intensive-service"

while true; do
    TEMP=$(vcgencmd measure_temp | grep -o '[0-9.]*')

    if (( $(echo "$TEMP > 70" | bc -l) )); then
        # High temp - reduce CPU
        sudo systemctl set-property "$SERVICE" CPUQuota=30%
        echo "Temperature $TEMP°C - CPU limited to 30%"
    elif (( $(echo "$TEMP > 60" | bc -l) )); then
        # Warm - moderate CPU
        sudo systemctl set-property "$SERVICE" CPUQuota=60%
        echo "Temperature $TEMP°C - CPU limited to 60%"
    else
        # Cool - full CPU
        sudo systemctl set-property "$SERVICE" CPUQuota=100%
        echo "Temperature $TEMP°C - CPU unlimited"
    fi

    sleep 30
done

Comparing with Containers

cgroups vs Docker/Podman

┌────────────────────────────────────────────────────────┐
│                    Docker Container                     │
│                                                         │
│  Includes:                                             │
│  ✓ cgroups (resource limits)                           │
│  ✓ Namespaces (isolation)                              │
│  ✓ Union filesystem (layered images)                   │
│  ✓ Network virtualization                              │
│  ✓ Image registry & distribution                       │
│                                                         │
│  Overhead: Higher (full isolation)                     │
│  Complexity: Higher (images, networking)               │
│  Portability: Excellent (images)                       │
└────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────┐
│              systemd-run + cgroups v2                   │
│                                                         │
│  Includes:                                             │
│  ✓ cgroups (resource limits)                           │
│  ✓ Basic sandboxing (filesystem, network)             │
│  ✗ No image system                                     │
│  ✗ No network virtualization                           │
│  ✗ No distribution mechanism                           │
│                                                         │
│  Overhead: Lower (lighter isolation)                   │
│  Complexity: Lower (native tools)                      │
│  Portability: Limited (host-specific)                  │
└────────────────────────────────────────────────────────┘

When to use systemd-run + cgroups:
✓ Resource limiting for native services
✓ Learning container fundamentals
✓ Lightweight isolation
✓ System service management
✓ Temporary/one-off tasks

When to use Docker/Podman:
✓ Application distribution
✓ Complex multi-service apps
✓ Microservices architecture
✓ Reproducible environments
✓ CI/CD pipelines

Understanding Container Internals

# Docker container is essentially:

# 1. Process in a cgroup
docker run -d --cpus=0.5 --memory=512m nginx
# Equivalent to:
systemd-run --unit=nginx-container \
    --property=CPUQuota=50% \
    --property=MemoryMax=512M \
    nginx

# 2. Plus namespaces (PID, network, mount, etc.)
# 3. Plus overlay filesystem (union mount)
# 4. Plus network bridge/veth pairs

# Inspect Docker's cgroup
docker ps --no-trunc
# Get container ID, then:
find /sys/fs/cgroup -name "*<container-id>*"
# Shows Docker's cgroup configuration

Summary

This comprehensive guide covered cgroups v2 and systemd-run for resource control on Raspberry Pi:

✅ Core Concepts

  • cgroups v2 unified hierarchy architecture
  • Controllers: CPU, memory, I/O, PIDs
  • systemd integration and native cgroup management
  • Differences from cgroups v1

✅ Resource Control

  • CPU: Quotas, weights, affinity, and relative sharing
  • Memory: Hard limits, soft limits, OOM policies, swap control
  • I/O: Bandwidth limiting, priority weighting, latency targets
  • PIDs: Process/thread count limiting

✅ systemd-run Usage

  • Ad-hoc service execution with resource limits
  • Temporary isolated environments
  • Dynamic resource allocation
  • One-off tasks with sandboxing

✅ Sandboxing

  • Filesystem restrictions (ProtectSystem, PrivateTmp, ReadOnly)
  • Network isolation (PrivateNetwork)
  • Capability dropping and granting
  • System call filtering
  • Complete sandbox examples

✅ Monitoring

  • systemd-cgtop for real-time resource monitoring
  • Resource accounting and statistics
  • Pressure Stall Information (PSI) for CPU/memory/I/O
  • cgroup file inspection

✅ Practical Applications

  • Build systems with resource constraints
  • Media encoding pipelines
  • Database services with guaranteed resources
  • Batch processing with adaptive scaling
  • Development environment isolation

✅ Advanced Techniques

  • User cgroup delegation
  • Process freezing/thawing
  • Memory and CPU pressure monitoring
  • Temperature-aware throttling for Raspberry Pi

✅ Production Best Practices

  • Optimal resource allocation strategies
  • SD card wear reduction
  • Performance tuning for Raspberry Pi
  • OOM handling and recovery

✅ Container Understanding

  • cgroups as container foundation
  • Comparison with Docker/Podman
  • When to use native cgroups vs containers
  • Docker cgroup inspection

Next Steps

Advanced Topics:

  1. Namespaces: Complete process isolation (PID, network, mount, user)
  2. seccomp: System call filtering for enhanced security
  3. AppArmor/SELinux: Mandatory access control integration
  4. cgroup BPF Programs: Custom resource control logic
  5. Kubernetes: Multi-node cgroup orchestration

Related Guides:

With cgroups v2 and systemd-run, you have powerful, kernel-level resource control for any workload - from simple CPU limiting to complex multi-service orchestration. This is the same technology powering Docker, Kubernetes, and modern cloud infrastructure!