cgroups v2 & systemd-run - Resource Control and Sandboxing for Raspberry Pi
Introduction
Control Groups (cgroups) v2 is a unified Linux kernel feature that provides hierarchical resource management and isolation for processes. Combined with systemd-run, it enables powerful process control, resource limiting, and sandboxing without requiring full containerization.
This technology transforms how you manage system resources:
- Resource Limiting: Set precise CPU, memory, and I/O limits per process
- Priority Control: Ensure critical services get resources they need
- Resource Isolation: Prevent runaway processes from affecting the system
- Accounting: Track exact resource consumption per service
- Dynamic Control: Adjust limits on running processes without restart
- Sandboxing: Restrict process capabilities and filesystem access
Why cgroups v2 matters for Raspberry Pi:
- Limited Resources: Make the most of limited RAM and CPU
- Multiple Services: Run many services without interference
- Development: Test resource-constrained scenarios
- Education: Understand container technology foundations
- Stability: Prevent memory/CPU exhaustion crashes
- Fair Sharing: Ensure responsive system under load
Common use cases:
- Build Jobs: Limit compilation processes to prevent system freezes
- Media Encoding: Control CPU/memory for ffmpeg/handbrake
- Database Services: Guarantee minimum resources for critical services
- Development Environments: Isolate test environments
- Backup Operations: Limit I/O impact during backups
- Web Services: Prevent DoS from resource exhaustion
- Batch Processing: Run background jobs with limited resources
- Container Understanding: Learn Docker/Podman internals
This comprehensive guide covers:
- cgroups v2 Fundamentals: Unified hierarchy and controllers
- systemd Integration: Native cgroup management with systemd
- CPU Control: Shares, quotas, and affinity management
- Memory Limits: Hard limits, soft limits, and OOM behavior
- I/O Control: Bandwidth limiting and priority scheduling
- systemd-run: Ad-hoc service execution with resource limits
- Sandboxing: Filesystem, network, and capability restrictions
- Monitoring: Real-time resource usage tracking
- Production Examples: Database, web server, build systems
- Performance Tuning: Optimal configurations for Raspberry Pi
Perfect for:
- System Administrators: Multi-tenant system management
- DevOps Engineers: Understanding container resource control
- Developers: Resource-constrained testing environments
- Students: Learning modern Linux resource management
- Home Lab Enthusiasts: Running multiple services efficiently
- Embedded Engineers: Deterministic resource allocation
- Performance Engineers: Controlling resource contention
Understanding cgroups v2
cgroups v1 vs v2
| cgroups v1 (Legacy - Multiple Hierarchies)
├── cpu/ (CPU controller)
│ ├── system.slice
│ └── user.slice
├── memory/ (Memory controller)
│ ├── system.slice
│ └── user.slice
├── blkio/ (Block I/O controller)
│ └── ...
└── ... (12+ separate hierarchies)
Problems:
- Controllers in different hierarchies
- Inconsistent behavior
- Complex management
- Resource conflicts
────────────────────────────────────────────────────
cgroups v2 (Unified Hierarchy)
/sys/fs/cgroup/
├── system.slice/
│ ├── cpu.max (All controllers)
│ ├── memory.max (in same)
│ ├── io.max (hierarchy)
│ ├── ssh.service/
│ └── nginx.service/
├── user.slice/
│ └── user-1000.slice/
└── machine.slice/ (Virtual machines/containers)
Benefits:
✓ Single unified hierarchy
✓ Consistent controller interface
✓ Better resource coordination
✓ Simplified management
|
cgroups v2 Architecture
| ┌─────────────────────────────────────────────────────────┐
│ Root cgroup (/) │
│ │
│ Controllers: cpu, memory, io, pids │
│ All system processes start here │
└────────────────────┬────────────────────────────────────┘
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│system.slice │ │user.slice │ │machine.slice │
│ │ │ │ │ │
│ System │ │ User │ │ VMs/ │
│ services │ │ sessions │ │ Containers │
└──────┬───────┘ └──────┬───────┘ └──────────────┘
│ │
│ ├─ user-1000.slice
│ │ └─ session-1.scope
│ │ ├─ app-1
│ │ └─ app-2
│ │
├─ nginx.service
├─ postgresql.service
└─ myapp.service
├─ cpu.max: 50000 100000 (50% CPU)
├─ memory.max: 512M
├─ io.weight: 100
└─ pids.max: 100
|
Available Controllers
| Controller |
Purpose |
Key Files |
| cpu |
CPU time distribution |
cpu.max, cpu.weight |
| memory |
RAM usage limits |
memory.max, memory.high |
| io |
Block device I/O |
io.max, io.weight |
| pids |
Process/thread limits |
pids.max |
| cpuset |
CPU/memory node binding |
cpuset.cpus, cpuset.mems |
| rdma |
RDMA/IB resources |
(Advanced networking) |
| hugetlb |
Huge page allocation |
(Advanced memory) |
Checking cgroups v2 Support
Verify cgroups v2 is Enabled
| # Check if cgroups v2 is mounted
mount | grep cgroup
# Should show: cgroup2 on /sys/fs/cgroup type cgroup2
# Check cgroup version
stat -fc %T /sys/fs/cgroup/
# Output: cgroup2fs (v2) or tmpfs (v1)
# List available controllers
cat /sys/fs/cgroup/cgroup.controllers
# Example: cpuset cpu io memory hugetlb pids rdma
# Check systemd cgroup driver
systemctl --version | grep "systemd"
# Raspberry Pi OS uses systemd with cgroups v2
# Verify systemd is using unified hierarchy
cat /proc/cmdline | grep -o "systemd.unified_cgroup_hierarchy=[0-9]"
# Should show: systemd.unified_cgroup_hierarchy=1
|
Enable cgroups v2 (if needed)
Most recent Raspberry Pi OS versions use cgroups v2 by default. If not:
| # Edit boot configuration
sudo nano /boot/firmware/cmdline.txt
# Add to the line (do not create new line):
systemd.unified_cgroup_hierarchy=1 cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1
# Reboot
sudo reboot
# Verify after reboot
stat -fc %T /sys/fs/cgroup/
|
CPU Control
Understanding CPU Limits
| # CPU controller parameters:
# cpu.weight (1-10000, default 100)
# - Relative CPU share (like nice values)
# - Higher weight = more CPU time when contending
# cpu.max (quota period)
# - Absolute CPU limit
# - Format: "max_microseconds period_microseconds"
# - Example: "50000 100000" = 50% of one CPU core
# cpu.stat
# - Statistics: usage_usec, user_usec, system_usec
|
CPU Limiting with systemd-run
| # Run a process with 50% CPU limit
systemd-run --unit=my-task \
--property=CPUQuota=50% \
stress --cpu 1
# Check status
systemctl status my-task
# View CPU usage
systemd-cgtop
# Stop
systemctl stop my-task
# Multiple CPUs - limit to 1.5 cores
systemd-run --unit=build-job \
--property=CPUQuota=150% \
make -j4
# CPU affinity - pin to specific cores
systemd-run --unit=pinned-task \
--property=AllowedCPUs=0,1 \
./my-program
# Relative CPU weight (priority)
systemd-run --unit=low-priority \
--property=CPUWeight=50 \
./background-task
systemd-run --unit=high-priority \
--property=CPUWeight=200 \
./important-task
|
Persistent CPU Limits for Services
| # Create service with CPU limits
sudo systemctl edit --force --full my-service.service
|
| [Unit]
Description=CPU-Limited Service
[Service]
Type=simple
ExecStart=/usr/local/bin/my-app
# CPU Limits
CPUQuota=75% # Max 75% of one core
CPUWeight=100 # Default relative weight
CPUAffinity=0 1 # Pin to cores 0 and 1
# Accounting
CPUAccounting=yes
[Install]
WantedBy=multi-user.target
|
| # Enable and start
sudo systemctl daemon-reload
sudo systemctl enable --now my-service
# Monitor CPU usage
systemd-cgtop /system.slice/my-service.service
# View detailed stats
systemctl show my-service | grep CPU
|
Memory Control
Memory Limit Types
| # memory.max
# - Hard limit: Process killed (OOM) if exceeded
# - Format: bytes or with unit (512M, 2G)
# memory.high
# - Soft limit: Throttling if exceeded
# - System tries to reclaim memory
# - More graceful than memory.max
# memory.low
# - Best-effort protection
# - Memory not reclaimed unless necessary
# memory.min
# - Hard protection
# - Memory never reclaimed
# memory.swap.max
# - Swap space limit
# - Default: unlimited
|
Memory Limiting with systemd-run
| # Hard memory limit (512MB)
systemd-run --unit=limited-app \
--property=MemoryMax=512M \
python3 memory-hungry-app.py
# Soft limit (warning at 400MB, hard limit 512MB)
systemd-run --unit=throttled-app \
--property=MemoryHigh=400M \
--property=MemoryMax=512M \
./app
# Memory protection (guarantee 256MB minimum)
systemd-run --unit=protected-db \
--property=MemoryMin=256M \
--property=MemoryMax=1G \
postgres
# Disable swap for this process
systemd-run --unit=no-swap-app \
--property=MemorySwapMax=0 \
./realtime-app
# Monitor memory usage
systemd-cgtop /system.slice/limited-app.service
|
Handling OOM (Out Of Memory)
| # Create service that handles OOM gracefully
sudo systemctl edit --force --full oom-test.service
|
| [Unit]
Description=OOM Test Service
[Service]
Type=simple
ExecStart=/usr/bin/stress --vm 1 --vm-bytes 1G
# Memory limits
MemoryMax=512M
MemoryAccounting=yes
# OOM policy
OOMPolicy=stop # Options: continue, stop, kill
OOMScoreAdjust=100 # Make more likely to be killed (default 0, range -1000 to 1000)
# Restart on OOM
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
|
| # Start and watch
sudo systemctl start oom-test
journalctl -u oom-test -f
# Check OOM kills
dmesg | grep -i oom
|
I/O Control
I/O Bandwidth Limiting
| # Find device major:minor numbers
lsblk
# Example output:
# NAME MAJ:MIN
# mmcblk0 179:0
# ├─mmcblk0p1 179:1
# └─mmcblk0p2 179:2
# Limit read bandwidth to 10 MB/s
systemd-run --unit=io-limited \
--property=IOReadBandwidthMax="/dev/mmcblk0 10M" \
dd if=/dev/mmcblk0 of=/dev/null bs=1M
# Limit write bandwidth to 5 MB/s
systemd-run --unit=backup-job \
--property=IOWriteBandwidthMax="/dev/mmcblk0 5M" \
rsync -av /source/ /backup/
# Combined read/write limits
systemd-run --unit=controlled-io \
--property=IOReadBandwidthMax="/dev/mmcblk0 20M" \
--property=IOWriteBandwidthMax="/dev/mmcblk0 10M" \
./disk-intensive-app
# I/O weight (relative priority, 1-10000)
systemd-run --unit=background-backup \
--property=IOWeight=50 \
rsync -av /data/ /backup/
systemd-run --unit=important-db \
--property=IOWeight=500 \
postgres
|
I/O Latency Control
| # Set I/O latency target (microseconds)
systemd-run --unit=low-latency \
--property=IODeviceLatencyTargetSec="/dev/mmcblk0 10ms" \
./latency-sensitive-app
# Monitor I/O stats
systemd-cgtop -d 1
# Shows: IO Read, IO Write per cgroup
|
Persistent I/O Limits
| # /etc/systemd/system/backup.service
[Unit]
Description=Nightly Backup Service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/backup.sh
# I/O Limits (reduce impact on system)
IOAccounting=yes
IOWeight=10 # Low priority
IOReadBandwidthMax=/dev/mmcblk0 5M # Limit reads
IOWriteBandwidthMax=/dev/mmcblk0 10M # Limit writes
[Install]
WantedBy=multi-user.target
|
Process Limits
PID Limits
| # Limit number of processes/threads
systemd-run --unit=fork-bomb-protection \
--property=TasksMax=50 \
./my-app
# Unlimited tasks
systemd-run --unit=many-threads \
--property=TasksMax=infinity \
./multithreaded-app
# Check current task count
systemctl status my-app | grep Tasks
# Tasks: 23 (limit: 50)
|
Sandboxing with systemd
Filesystem Sandboxing
| # Read-only root filesystem
systemd-run --unit=readonly-fs \
--property=ProtectSystem=strict \
--property=ReadWritePaths=/var/lib/myapp \
./my-app
# Isolate /tmp
systemd-run --unit=private-tmp \
--property=PrivateTmp=yes \
./app
# Hide /home directories
systemd-run --unit=no-home \
--property=ProtectHome=yes \
./service
# Restrict path visibility
systemd-run --unit=restricted \
--property=InaccessiblePaths=/root \
--property=InaccessiblePaths=/boot \
./app
|
Network Sandboxing
| # Disable network access
systemd-run --unit=offline-app \
--property=PrivateNetwork=yes \
./compute-task
# Isolate network namespace (private loopback)
systemd-run --unit=isolated-net \
--property=PrivateNetwork=yes \
python3 -m http.server 8080
# Accessible only within this network namespace
|
Capability Restrictions
| # Drop all capabilities
systemd-run --unit=no-caps \
--property=CapabilityBoundingSet= \
./app
# Grant specific capabilities
systemd-run --unit=bind-low-port \
--property=AmbientCapabilities=CAP_NET_BIND_SERVICE \
--property=User=nobody \
./webserver --port 80
# Common capabilities:
# CAP_NET_BIND_SERVICE - Bind to ports < 1024
# CAP_NET_RAW - Raw sockets (ping)
# CAP_SYS_ADMIN - Mount, etc. (dangerous!)
|
Complete Sandbox Example
| # /etc/systemd/system/sandboxed-web.service
[Unit]
Description=Sandboxed Web Application
[Service]
Type=simple
ExecStart=/usr/local/bin/webapp
User=webapp
Group=webapp
# Resource Limits
CPUQuota=50%
MemoryMax=512M
TasksMax=100
IOWeight=100
# Filesystem Sandboxing
ProtectSystem=strict # Read-only /usr, /boot, /etc
ProtectHome=yes # No access to /home
PrivateTmp=yes # Private /tmp
ReadWritePaths=/var/lib/webapp # Only writable location
NoNewPrivileges=yes # Can't gain privileges
# Network Sandboxing
IPAddressDeny=any # Block all by default
IPAddressAllow=localhost # Allow localhost
IPAddressAllow=192.168.1.0/24 # Allow local network
# Kernel/System Protection
ProtectKernelTunables=yes # /proc/sys, /sys read-only
ProtectKernelModules=yes # Can't load kernel modules
ProtectControlGroups=yes # cgroup hierarchy read-only
ProtectKernelLogs=yes # No access to kernel logs
# Capabilities
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
# System Calls
SystemCallFilter=@system-service # Whitelist common syscalls
SystemCallFilter=~@privileged # Blacklist privileged syscalls
SystemCallErrorNumber=EPERM # Return permission denied
# Misc
RestrictRealtime=yes # No realtime scheduling
RestrictNamespaces=yes # Can't create namespaces
LockPersonality=yes # Prevent execution domain changes
[Install]
WantedBy=multi-user.target
|
Monitoring and Inspection
systemd-cgtop - Real-Time Resource Monitor
| # Interactive resource monitor
systemd-cgtop
# Output example:
# Control Group Tasks %CPU Memory Input/s Output/s
# / 234 12.5 1.2G - -
# /system.slice 89 8.2 512.0M 1.2M/s 500K/s
# /system.slice/nginx.service 12 2.1 64.0M - 100K/s
# /system.slice/postgres.service 25 5.8 256.0M 800K/s 200K/s
# /user.slice 98 4.1 512.0M - -
# Update every second
systemd-cgtop -d 1
# Show specific cgroup
systemd-cgtop /system.slice/nginx.service
# Sort by memory
systemd-cgtop --order=memory
# Sort by CPU
systemd-cgtop --order=cpu
|
Inspecting cgroup Settings
| # Show all properties of a service
systemctl show nginx.service
# Filter for resource properties
systemctl show nginx.service | grep -E "(CPU|Memory|IO|Tasks)"
# View cgroup files directly
ls /sys/fs/cgroup/system.slice/nginx.service/
cat /sys/fs/cgroup/system.slice/nginx.service/memory.current
cat /sys/fs/cgroup/system.slice/nginx.service/cpu.stat
# View process cgroup membership
cat /proc/PID/cgroup
# Example: 0::/system.slice/nginx.service
|
Resource Accounting Statistics
| # CPU usage statistics
systemctl show nginx.service | grep CPUUsage
# CPUUsageNSec=1234567890000 (nanoseconds)
# Memory usage
systemctl show nginx.service | grep Memory
# MemoryCurrent=67108864 (bytes)
# MemoryPeak=134217728
# I/O statistics
systemctl show nginx.service | grep IO
# IOReadBytes=1048576000
# IOWriteBytes=524288000
# Task count
systemctl show nginx.service | grep Tasks
# TasksCurrent=12
# TasksMax=100
|
Practical Examples
Build System with Resource Limits
| #!/bin/bash
# build-with-limits.sh - Compile project with resource constraints
PROJECT_DIR="/home/pi/myproject"
BUILD_DIR="$PROJECT_DIR/build"
echo "Starting build with resource limits..."
systemd-run \
--unit=project-build-$(date +%s) \
--collect \
--wait \
--working-directory="$PROJECT_DIR" \
--property=CPUQuota=75% \
--property=MemoryMax=1G \
--property=IOWeight=50 \
--property=Nice=10 \
bash -c "
mkdir -p build
cd build
cmake ..
make -j2
"
BUILD_STATUS=$?
if [ $BUILD_STATUS -eq 0 ]; then
echo "Build completed successfully!"
else
echo "Build failed with status $BUILD_STATUS"
fi
exit $BUILD_STATUS
|
| #!/bin/bash
# encode-videos.sh - Batch video encoding with resource limits
INPUT_DIR="/home/pi/videos/input"
OUTPUT_DIR="/home/pi/videos/output"
mkdir -p "$OUTPUT_DIR"
# Process each video file
find "$INPUT_DIR" -type f -name "*.mp4" | while read -r VIDEO; do
BASENAME=$(basename "$VIDEO" .mp4)
OUTPUT_FILE="$OUTPUT_DIR/${BASENAME}_compressed.mp4"
echo "Encoding: $BASENAME"
# Run ffmpeg with limits (prevent system freeze)
systemd-run \
--unit="encode-$BASENAME" \
--collect \
--wait \
--property=CPUQuota=60% \
--property=MemoryMax=512M \
--property=IOWeight=30 \
--property=Nice=15 \
ffmpeg -i "$VIDEO" \
-c:v libx264 \
-crf 28 \
-preset medium \
-c:a aac \
-b:a 128k \
"$OUTPUT_FILE"
if [ $? -eq 0 ]; then
echo "✓ Completed: $BASENAME"
else
echo "✗ Failed: $BASENAME"
fi
done
echo "All encoding jobs completed"
|
Database Service with Guaranteed Resources
| # /etc/systemd/system/postgresql-custom.service
[Unit]
Description=PostgreSQL with Resource Guarantees
After=network.target
[Service]
Type=notify
User=postgres
Group=postgres
ExecStart=/usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf
# CPU
CPUAccounting=yes
CPUWeight=200 # High priority (default 100)
CPUQuota=150% # Up to 1.5 cores
# Memory
MemoryAccounting=yes
MemoryMin=512M # Guaranteed minimum
MemoryHigh=1G # Soft limit (throttle)
MemoryMax=2G # Hard limit (OOM)
MemorySwapMax=0 # No swap
# I/O
IOAccounting=yes
IOWeight=500 # High I/O priority
IOReadBandwidthMax=/dev/mmcblk0 50M
IOWriteBandwidthMax=/dev/mmcblk0 30M
# Processes
TasksMax=200
# Restart policy
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
|
Containerized Development Environment
| #!/bin/bash
# dev-env.sh - Isolated development environment
DEV_ROOT="/tmp/dev-env-$$"
mkdir -p "$DEV_ROOT"/{bin,lib,tmp,workspace}
# Copy essential binaries
cp /bin/bash "$DEV_ROOT/bin/"
cp /bin/ls "$DEV_ROOT/bin/"
cp /usr/bin/python3 "$DEV_ROOT/bin/"
# Copy shared libraries
for BIN in "$DEV_ROOT/bin/"*; do
ldd "$BIN" 2>/dev/null | grep -o '/lib[^ ]*' | while read -r LIB; do
mkdir -p "$DEV_ROOT$(dirname $LIB)"
cp "$LIB" "$DEV_ROOT$LIB" 2>/dev/null
done
done
echo "Starting isolated development environment..."
systemd-run \
--unit="dev-env-$$" \
--pty \
--same-dir \
--collect \
--property=CPUQuota=100% \
--property=MemoryMax=512M \
--property=PrivateNetwork=yes \
--property=PrivateTmp=yes \
--property=ProtectHome=yes \
--property=RootDirectory="$DEV_ROOT" \
--property=BindReadOnlyPaths="/home/pi/projects:$DEV_ROOT/workspace" \
/bin/bash
# Cleanup on exit
rm -rf "$DEV_ROOT"
|
Batch Processing with Dynamic Scaling
| #!/usr/bin/env python3
# batch-processor.py - Process jobs with adaptive resource limits
import subprocess
import sys
import time
def get_cpu_usage():
"""Get current system CPU usage percentage"""
with open('/proc/stat', 'r') as f:
line = f.readline()
fields = [int(x) for x in line.split()[1:]]
idle = fields[3]
total = sum(fields)
time.sleep(0.1)
with open('/proc/stat', 'r') as f:
line = f.readline()
fields = [int(x) for x in line.split()[1:]]
idle2 = fields[3]
total2 = sum(fields)
idle_delta = idle2 - idle
total_delta = total2 - total
usage = 100.0 * (1.0 - idle_delta / total_delta)
return usage
def process_job(job_id, job_command, cpu_quota, memory_limit):
"""Process a single job with resource limits"""
print(f"Starting job {job_id} (CPU: {cpu_quota}%, Memory: {memory_limit})")
cmd = [
'systemd-run',
f'--unit=batch-job-{job_id}',
'--collect',
'--wait',
f'--property=CPUQuota={cpu_quota}%',
f'--property=MemoryMax={memory_limit}',
'--property=IOWeight=30',
'bash', '-c', job_command
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
print(f"✓ Job {job_id} completed")
else:
print(f"✗ Job {job_id} failed: {result.stderr}")
return result.returncode
def main():
jobs = [
"sleep 5 && echo 'Job 1 done'",
"stress --cpu 1 --timeout 10",
"dd if=/dev/zero of=/tmp/test bs=1M count=100",
"sleep 3 && echo 'Job 4 done'",
]
for idx, job in enumerate(jobs, 1):
# Adaptive resource allocation based on system load
cpu_usage = get_cpu_usage()
if cpu_usage < 30:
cpu_quota = 100 # System idle, use full core
memory_limit = "512M"
elif cpu_usage < 70:
cpu_quota = 50 # Moderate load
memory_limit = "256M"
else:
cpu_quota = 25 # High load, be conservative
memory_limit = "128M"
print(f"System CPU usage: {cpu_usage:.1f}%")
process_job(idx, job, cpu_quota, memory_limit)
time.sleep(1)
if __name__ == '__main__':
main()
|
Advanced Techniques
Delegating cgroup Control to Users
| # Allow user to manage their own cgroups
sudo mkdir -p /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice
sudo chown -R pi:pi /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice
# User can now create sub-cgroups
systemd-run --user --unit=my-app \
--property=CPUQuota=50% \
./my-program
|
Freezing and Thawing Processes
| # Freeze all processes in a cgroup (pause execution)
systemctl kill -s STOP nginx.service
# Resume
systemctl kill -s CONT nginx.service
# Using cgroup freezer (if available)
echo 1 > /sys/fs/cgroup/system.slice/nginx.service/cgroup.freeze
# Resume
echo 0 > /sys/fs/cgroup/system.slice/nginx.service/cgroup.freeze
|
Memory Pressure Monitoring
| # Watch memory pressure
cat /sys/fs/cgroup/system.slice/nginx.service/memory.pressure
# Output format (PSI - Pressure Stall Information):
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# "some" = some processes delayed
# "full" = all processes delayed
# Higher values = memory pressure
# Monitor continuously
watch -n 1 cat /sys/fs/cgroup/system.slice/nginx.service/memory.pressure
|
CPU Pressure Monitoring
| # CPU pressure (process wait times)
cat /sys/fs/cgroup/system.slice/nginx.service/cpu.pressure
# Monitor I/O pressure
cat /sys/fs/cgroup/system.slice/nginx.service/io.pressure
|
Troubleshooting
Controllers Not Available
| # Check which controllers are available
cat /sys/fs/cgroup/cgroup.controllers
# If controllers missing, check kernel config
zcat /proc/config.gz | grep -i cgroup
# Should show CONFIG_CGROUPS=y, CONFIG_MEMCG=y, etc.
# Enable controllers for subtree
echo "+cpu +memory +io" > /sys/fs/cgroup/cgroup.subtree_control
|
Permission Denied Errors
| # Can't write to cgroup files?
# Check ownership
ls -l /sys/fs/cgroup/system.slice/my-service.service/
# systemd manages cgroups - use systemctl
# Don't manually write to cgroup files under system.slice
# For user services, use --user flag
systemd-run --user --unit=my-app ./program
|
Resource Limits Not Applied
| # Check if accounting is enabled
systemctl show my-service | grep Accounting
# Enable accounting
sudo systemctl set-property my-service.service \
CPUAccounting=yes \
MemoryAccounting=yes \
IOAccounting=yes
# Or in service file:
# [Service]
# CPUAccounting=yes
# MemoryAccounting=yes
# IOAccounting=yes
|
OOM Kills Despite Memory Limit
| # Check memory.events for OOM kills
cat /sys/fs/cgroup/system.slice/my-service.service/memory.events
# Look for "oom_kill" counter
# Check kernel messages
dmesg | grep -i "killed process"
journalctl -k | grep -i oom
# Possible causes:
# 1. Limit too low
# 2. Memory leak
# 3. Swap disabled (MemorySwapMax=0)
# Increase limit or add swap
sudo systemctl set-property my-service.service MemoryMax=1G
|
Optimal Resource Allocation
| # Raspberry Pi 4 (4GB RAM example)
# Reserve resources for system
# System services: ~500MB RAM, 20% CPU
# Critical services (database, web server)
# - High CPUWeight (200-500)
# - Guaranteed MemoryMin
# - High IOWeight (200-500)
# Background services (backups, monitoring)
# - Low CPUWeight (10-50)
# - Best-effort memory
# - Low IOWeight (10-50)
# Example allocation:
# postgresql: CPUWeight=300, MemoryMin=512M, IOWeight=400
# nginx: CPUWeight=200, MemoryMin=256M, IOWeight=300
# backup: CPUWeight=20, MemoryMax=256M, IOWeight=10
|
Avoiding SD Card Wear
| # Limit write-heavy services
sudo systemctl set-property logging-service.service \
IOWriteBandwidthMax="/dev/mmcblk0 1M"
# Use tmpfs for temporary files
# PrivateTmp=yes in service files
# Batch writes
# Configure applications to flush less frequently
|
Temperature-Based Throttling
| #!/bin/bash
# temp-aware-limits.sh - Adjust CPU limits based on temperature
SERVICE="intensive-service"
while true; do
TEMP=$(vcgencmd measure_temp | grep -o '[0-9.]*')
if (( $(echo "$TEMP > 70" | bc -l) )); then
# High temp - reduce CPU
sudo systemctl set-property "$SERVICE" CPUQuota=30%
echo "Temperature $TEMP°C - CPU limited to 30%"
elif (( $(echo "$TEMP > 60" | bc -l) )); then
# Warm - moderate CPU
sudo systemctl set-property "$SERVICE" CPUQuota=60%
echo "Temperature $TEMP°C - CPU limited to 60%"
else
# Cool - full CPU
sudo systemctl set-property "$SERVICE" CPUQuota=100%
echo "Temperature $TEMP°C - CPU unlimited"
fi
sleep 30
done
|
Comparing with Containers
cgroups vs Docker/Podman
| ┌────────────────────────────────────────────────────────┐
│ Docker Container │
│ │
│ Includes: │
│ ✓ cgroups (resource limits) │
│ ✓ Namespaces (isolation) │
│ ✓ Union filesystem (layered images) │
│ ✓ Network virtualization │
│ ✓ Image registry & distribution │
│ │
│ Overhead: Higher (full isolation) │
│ Complexity: Higher (images, networking) │
│ Portability: Excellent (images) │
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ systemd-run + cgroups v2 │
│ │
│ Includes: │
│ ✓ cgroups (resource limits) │
│ ✓ Basic sandboxing (filesystem, network) │
│ ✗ No image system │
│ ✗ No network virtualization │
│ ✗ No distribution mechanism │
│ │
│ Overhead: Lower (lighter isolation) │
│ Complexity: Lower (native tools) │
│ Portability: Limited (host-specific) │
└────────────────────────────────────────────────────────┘
When to use systemd-run + cgroups:
✓ Resource limiting for native services
✓ Learning container fundamentals
✓ Lightweight isolation
✓ System service management
✓ Temporary/one-off tasks
When to use Docker/Podman:
✓ Application distribution
✓ Complex multi-service apps
✓ Microservices architecture
✓ Reproducible environments
✓ CI/CD pipelines
|
Understanding Container Internals
| # Docker container is essentially:
# 1. Process in a cgroup
docker run -d --cpus=0.5 --memory=512m nginx
# Equivalent to:
systemd-run --unit=nginx-container \
--property=CPUQuota=50% \
--property=MemoryMax=512M \
nginx
# 2. Plus namespaces (PID, network, mount, etc.)
# 3. Plus overlay filesystem (union mount)
# 4. Plus network bridge/veth pairs
# Inspect Docker's cgroup
docker ps --no-trunc
# Get container ID, then:
find /sys/fs/cgroup -name "*<container-id>*"
# Shows Docker's cgroup configuration
|
Summary
This comprehensive guide covered cgroups v2 and systemd-run for resource control on Raspberry Pi:
✅ Core Concepts
- cgroups v2 unified hierarchy architecture
- Controllers: CPU, memory, I/O, PIDs
- systemd integration and native cgroup management
- Differences from cgroups v1
✅ Resource Control
- CPU: Quotas, weights, affinity, and relative sharing
- Memory: Hard limits, soft limits, OOM policies, swap control
- I/O: Bandwidth limiting, priority weighting, latency targets
- PIDs: Process/thread count limiting
✅ systemd-run Usage
- Ad-hoc service execution with resource limits
- Temporary isolated environments
- Dynamic resource allocation
- One-off tasks with sandboxing
✅ Sandboxing
- Filesystem restrictions (ProtectSystem, PrivateTmp, ReadOnly)
- Network isolation (PrivateNetwork)
- Capability dropping and granting
- System call filtering
- Complete sandbox examples
✅ Monitoring
- systemd-cgtop for real-time resource monitoring
- Resource accounting and statistics
- Pressure Stall Information (PSI) for CPU/memory/I/O
- cgroup file inspection
✅ Practical Applications
- Build systems with resource constraints
- Media encoding pipelines
- Database services with guaranteed resources
- Batch processing with adaptive scaling
- Development environment isolation
✅ Advanced Techniques
- User cgroup delegation
- Process freezing/thawing
- Memory and CPU pressure monitoring
- Temperature-aware throttling for Raspberry Pi
✅ Production Best Practices
- Optimal resource allocation strategies
- SD card wear reduction
- Performance tuning for Raspberry Pi
- OOM handling and recovery
✅ Container Understanding
- cgroups as container foundation
- Comparison with Docker/Podman
- When to use native cgroups vs containers
- Docker cgroup inspection
Next Steps
Advanced Topics:
- Namespaces: Complete process isolation (PID, network, mount, user)
- seccomp: System call filtering for enhanced security
- AppArmor/SELinux: Mandatory access control integration
- cgroup BPF Programs: Custom resource control logic
- Kubernetes: Multi-node cgroup orchestration
Related Guides:
With cgroups v2 and systemd-run, you have powerful, kernel-level resource control for any workload - from simple CPU limiting to complex multi-service orchestration. This is the same technology powering Docker, Kubernetes, and modern cloud infrastructure!