Linux cgroups and namespaces

Содержание

difference between cgroups and namespaces
What even is a container: namespaces and cgroups
namespaces
entering the namespace of another program
cgroups: resource limits
seccomp-bpf
what are containers?
learn more in “How Containers Work”

difference between cgroups and namespaces

The proper links for those two notions have been fixed in PR 14307:

Under the hood, Docker is built on the following components:

The cgroups and namespaces capabilities of the Linux kernel

cgroup: Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.
namespace: wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.

Cgroups = limits how much you can use;
namespaces = limits what you can see (and therefore use)

Cgroups involve resource metering and limiting:

Namespaces provide processes with their own view of the system

pid
net
mnt
uts
ipc
user: userns it is graduating from experimental in docker 1.10
(per-daemon-instance remapping of container root to an unprivileged user is in progress: PR 12648: see its design)

No, chroot is not based on a namespace: see itnext.io/…. More than three years later though, keep in mind the new Docker (19.03, still in beta) can be run as rootless: github.com/moby/moby/blob/…. And can expose resources on the host namespace: github.com/moby/moby/pull/38913

cgroups limits the resources which a process or set of processes can use these resources could be CPU,Memory,Network I/O or access to filesystem while namespace restrict the visibility of group of processes to the rest of the system.

Cgroups(control groups) does resource management.
It determines how much host machine resources to be given to containers.

for example:- we define resources in docker-compose yaml file for creating services like:

resources: limits: cpus: "0.1" (100 millicores) memory: 50M

Here, in this example we are explicitly asking cgroups to allocate these resources to particular container.

Namespaces: provides process isolation, complete isolation of containers, separate file system. There are 6 types of namespaces:
1. mount ns — for file system.
2. UTS(Unique time sharing) ns- which checks for different hostnames of running containers
3. IPC ns — interprocess communication
4. Network ns- takes care of different ip allocation to different containers
5. PID ns — process id isolation
6. user ns- different username(uid)

Источник

What even is a container: namespaces and cgroups

The first time I heard about containers it was like – what? what’s that?

Is a container a process? What’s Docker? Are containers Docker? Help!

The word “container” doesn’t mean anything super precise. Basically there are a few new Linux kernel features (“namespaces” and “cgroups”) that let you isolate processes from each other. When you use those features, you call it “containers”.

Basically these features let you pretend you have something like a virtual machine, except it’s not a virtual machine at all, it’s just processes running in the same Linux kernel. Let’s dive in!

namespaces

Okay, so let’s say we wanted to have something like a virtual machine. One feature you might want is – my processes should be separated from the other processes on the computer, right?

One feature Linux provides here is namespaces. There are a bunch of different kinds:

in a pid namespace you become PID 1 and then your children are other processes. All the other programs are gone
in a networking namespace you can run programs on any port you want without it conflicting with what’s already running
in a mount namespace you can mount and unmount filesystems without it affecting the host filesystem. So you can have a totally different set of devices mounted (usually less)

It turns out that making namespaces is totally easy! You can just run a program called unshare (named after the system call of the same name)

Let’s make a new PID namespace and run bash in it!

$ sudo unshare --fork --pid --mount-proc bash

root@kiwi:~# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 28372 4148 pts/6 S 23:01 0:00 bash root 2 0.0 0.0 44432 3836 pts/6 R+ 23:01 0:00 ps aux

Wow, we’re in a whole new world! There are only 2 processes running – bash and ps. Cool, that was easy!

It’s worth noting that if I look from my regular PID namespaces, I can see the processes in the new PID namespace:

root 14121 0.0 0.0 33264 4044 pts/6 S+ 23:09 0:00 htop

This process 14121 (regular namespace) is process 3 in my new PID namespace. So they’re two views on the same thing, but one is a lot more restricted.

entering the namespace of another program

Also you can enter the namespace of another running program! You do this with a command called nsenter . I think this is how docker exec works? Maybe?

cgroups: resource limits

Okay, so we’ve made a new magical world with new processes and sockets that is separate from our old world. Cool!

What if I want to limit how much memory or CPU one of my programs is using? WE’RE IN LUCK. In 2007 some people developed cgroups just for us. These are like when you nice a process but with a bunch more features.

Let’s make a cgroup! We’ll make one that just limits memory

$ sudo cgcreate -a bork -g memory:mycoolgrou

$ ls -l /sys/fs/cgroup/memory/mycoolgroup/ -rw-r--r-- 1 bork root 0 Okt 10 23:16 memory.kmem.limit_in_bytes -rw-r--r-- 1 bork root 0 Okt 10 23:14 memory.kmem.max_usage_in_bytes

ooh, max usage in bytes! Okay, let’s try that! 10 megabytes should be enough for anyone!

$ sudo echo 10000000 > /sys/fs/cgroup/memory/mycoolgroup/memory.limit_in_bytes

Awesome, let’s try using my cgroup!

$ sudo cgexec -g memory:mycoolgroup bash

I ran a bunch of commands. they worked fine. Then I tried compiling a Rust program 🙂 🙂 🙂

$ root@kiwi:~/work/ruby-stacktrace# cargo build error: Could not execute process `rustc -vV` (never executed) Caused by: Cannot allocate memory (os error 12)

Fantastic! We have successfully limited our program’s memory!

seccomp-bpf

Okay, one last feature! If you’re isolating your processes, you might in addition to restricting their memory and CPU usage, want to restrict what system calls they can run! Like, “no network access for you!”. That might help with security! We like security.

This brings us to seccomp-bpf, a Linux kernel feature that lets you filter which system calls your process can run.

what are containers?

Okay, now that you’ve seen these two features you might think “wow, yeah, I could build a bunch of scripts around all these features and have something really cool!” It would be really lightweight and my processes would be isolated from each other, and, wow!

Some people thought that in the past too! They built a thing called “Docker containers” that uses these features :). That’s all Docker is! Of course Docker has a lot of features these days, but a lot of it is built on these basic Linux kernel primitives.

learn more in “How Containers Work”

In 2020, I published a whole zine about containers called “How containers work”. You can get it here for $12.

Here’s the cover and table of contents:

You might also like the Recurse Center, my very favorite programming community (my posts about it)

Источник