fs: Add two LWN links to NFS articles by Neil Brown.
[aple.git] / OS-Level_Virtualization.m4
2 Fools ignore complexity. Pragmatists suffer it. Some can avoid it.
3 Geniuses remove it. -- Perlis's Programming Proverb #58 (1982)
4 », __file__)
9 In general, virtualization refers to the abstraction of computer
10 resources. This chapter is primarily concerned with <em> server
11 virtualization</em>, a concept which makes it possible to run
12 more than one operating system simultaneously and independently
13 of each other on a single physical computer. We first describe
14 the different virtualization frameworks but quickly specialize on
15 Linux OS-level virtualization and their virtual machines called <em>
16 containers</em>. Container platforms for Linux are built on top of
17 <em>namespaces</em> and <em>control groups</em>, the low-level kernel
18 features which implement abstraction and isolation of processes. We
19 look at both concepts in some detail. The final section discusses
20 <em>micoforia</em>, a minimal container platform.
22 »)
24 SECTION(«Virtualization Frameworks»)
26 The origins of server virtualization date back to the 1960s. The
27 first virtual machine was created as a collaboration between IBM
28 (International Business Machines) and the MIT (Massachusetts Institute
29 of Technology). Since then, many different approaches have been
30 designed, resulting in several <em> Virtualization Frameworks</em>. All
31 frameworks promise to improve resource utilization and availability, to
32 reduce costs, and to provide greater flexibility. While some of these
33 benefits might be real, they do not come for free. Their costs include:
34 the host becomes a single point of failure, decreased performance,
35 added complexity and increased maintenance costs due to extensive
36 debugging, documentation, and maintenance of the VMs. This chapter
37 briefly describes the three main virtualization frameworks. We list
38 the advantages and disadvantages of each and give some examples.
40 SUBSECTION(«Software Virtualization (Emulation)»)
42 This virtualization framework does not play a significant role in
43 server virtualization, it is only included for completeness. Emulation
44 means to imitate a complete hardware architecture in software,
45 including peripheral devices. All CPU instructions and hardware
46 interrupts are interpreted by the emulator rather than being run by
47 native hardware. Since this approach has a large performance penalty,
48 it is only suitable when speed is not critical. For this reason,
49 emulation is typically employed for ancient hardware like arcade
50 game systems and home computers such as the Commodore 64. Despite
51 the performance penalty, emulation is valuable because it allows
52 applications and operating systems to run on the current platform as
53 they did in their original environment.
55 Examples: Bochs, Mame, VICE.
57 SUBSECTION(«Paravirtualization and Hardware-Assisted Virtualization»)
59 These virtualization frameworks are characterized by the presence
60 of a <em> hypervisor</em>, also known as <em> Virtual Machine
61 Monitor</em>, which translates system calls from the VMs to native
62 hardware requests. In contrast to Software Virtualization, the
63 host OS does not emulate hardware resources but offers a special
64 APIs to the VMs. If the presented interface is different to that
65 of the underlying hardware, the term <em> paravirtualization </em>
66 is used. The guest OS then has to be modified to include modified
67 (paravirtualized) drivers. In 2005 AMD and Intel added hardware
68 virtualization instructions to the CPUs and IOMMUs (Input/Output memory
69 management units) to the chipsets. This allowed VMs to directly execute
70 privileged instructions and use peripheral devices. This so-called <em>
71 Hardware-Assisted Virtualization </em> allows unmodified operating
72 systems to run on the VMs.
74 The main advantage of Hardware-Assisted Virtualization is its
75 flexibility, as the host OS does not need to match the OS running on
76 the VMs. The disadvantages are hardware compatibility constraints and
77 performance loss. Although these days all hardware has virtualization
78 support, there are still significant differences in performance between
79 the host and the VM. Moreover, peripheral devices like storage hardware
80 has to be compatible with the chipset to make use of the IOMMU.
82 Examples: KVM (with QEMU as hypervisor), Xen, UML
84 SUBSECTION(«OS-level Virtualization (Containers)»)
86 OS-level Virtualization is a technique for lightweight virtualization.
87 The abstractions are built directly into the kernel and no
88 hypervisor is needed. In this context the term "virtual machine" is
89 inaccurate, which is why the OS-level VMs are called differently in
90 this context. On Linux, they are called <em> containers</em>, other
91 operating systems call them <em> jails </em> or <em> zones</em>. We
92 shall exclusively use "container" from now on. All containers share
93 a single kernel, so the OS running in the container has to match the
94 host OS. However, each container has its own root file system, so
95 containers can differ in user space. For example, different containers
96 can run different Linux distributions. Since programs running in a
97 container use the normal system call interface to communicate with
98 the kernel, OS-level Virtualization does not require hardware support
99 for efficient performance. In fact, OS-level Virtualization imposes
100 no overhead at all.
102 OS-level Virtualization is superior to the alternatives because of its
103 simplicity and its performance. The only disadvantage is the lack of
104 flexibility. It is simply not an option if some of the VMs must run
105 different operating systems than the host.
107 Examples: LXC, Micoforia, Singularity, Docker.
111 <ul>
113 <li> On any Linux system, check if the processor supports virtualization
114 by running <code> cat /proc/cpuinfo</code>. Hint: svm and vmx. </li>
116 <li> Hypervisors come in two flavors called <em> native </em> and <em>
117 hosted</em>. Explain the difference and the pros and cons of either
118 flavor. Is QEMU a native or a hosted hypervisor? </li>
120 <li> Find the AMD Programmer's Manual online. The chapter on
121 "Secure Virtual Machine" describes the CPU instructions related to
122 Hardware-Assisted Virtualization. Glance over this chapter to get an
123 idea of the complexity of Hardware-Assisted Virtualization. </li>
125 </ul>
129 <ul>
130 <li> Recall the concept of <em> direct memory access </em> (DMA)
131 and explain why DMA is a problem for virtualization. Which of the
132 three virtualization frameworks of this chapter are affected by this
133 problem? </li>
135 <li> Compare AMD's <em> Rapid Virtualization Indexing </em> to Intel's
136 <em> Extended Page Tables</em>. </li>
138 <li> Suppose a hacker gained root access to a VM and wishes to proceed
139 from there to get also full control over the host OS. Discuss the thread
140 model in the context of the three virtualization frameworks covered
141 in this section. </li>
143 </ul>
144 »)
146 SECTION(«Namespaces»)
148 Namespaces partition the set of processes into disjoint subsets
149 with local scope. Where the traditional Unix systems provided only
150 a single system-wide resource shared by all processes, the namespace
151 abstractions make it possible to give processes the illusion of living
152 in their own isolated instance. Linux implements the following
153 six different types of namespaces: mount (Linux-2.4.x, 2002), IPC
154 (Linux-2.6.19, 2006), UTS (Linux-2.6.19, 2006), PID (Linux-2.6.24,
155 2008), network (Linux-2.6.29, 2009), UID (Linux-3.8, 2013).
156 For OS-level virtualization all six name space types are typically
157 employed to make the containers look like independent systems.
159 Before we look at each namespace type, we briefly describe how
160 namespaces are created and how information related to namespaces can
161 be obtained for a process.
163 SUBSECTION(«Namespace API»)
165 <p> Initially, there is only a single namespace of each type called the
166 <em> root namespace</em>. All processes belong to this namespace. The
167 <code> clone(2) </code> system call is a generalization of the classic
168 <code> fork(2) </code> which allows privileged users to create new
169 namespaces by passing one or more of the six <code> NEW_ </code>
170 flags. The child process is made a member of the new namespace. Calling
171 plain <code> fork(2) </code> or <code> clone(2) </code> with no
172 <code> NEW_* </code> flag lets the newly created process inherit the
173 namespaces from its parent. There are two additional system calls,
174 <code> setns(2) </code> and <code> unshare(2) </code> which both
175 change the namespace(s) of the calling process without creating a
176 new process. For the latter, there is a user command, also called
177 <code> unshare(1) </code> which makes the namespace API available to
178 scripts. </p>
180 <p> The <code> /proc/$PID </code> directory of each process contains a
181 <code> ns </code> subdirectory which contains one file per namespace
182 type. The inode number of this file is the <em> namespace ID</em>.
183 Hence, by running <code> stat(1) </code> one can tell whether
184 two different processes belong to the same namespace. Normally a
185 namespace ceases to exist when the last process in the namespace
186 terminates. However, by opening <code> /proc/$PID/ns/$TYPE </code>
187 one can prevent the namespace from disappearing. </p>
189 SUBSECTION(«UTS Namespaces»)
191 UTS is short for <em> UNIX Time-sharing System</em>. The old fashioned
192 word "Time-sharing" has been replaced by <em> multitasking</em>
193 but the old name lives on in the <code> uname(2) </code> system
194 call which fills out the fields of a <code> struct utsname</code>.
195 On return the <code> nodename </code> field of this structure
196 contains the hostname which was set by a previous call to <code>
197 sethostname(2)</code>. Similarly, the <code> domainname </code> field
198 contains the string that was set with <code> setdomainname(2)</code>.
200 UTS namespaces provide isolation of these two system identifiers. That
201 is, processes in different UTS namespaces might see different host- and
202 domain names. Changing the host- or domainname affects only processes
203 which belong to the same UTS namespace as the process which called
204 <code> sethostname(2) </code> or <code> setdomainname(2)</code>.
206 SUBSECTION(«Mount Namespaces»)
208 The <em> mount namespaces </em> are the oldest Linux namespace
209 type. This is kind of natural since they are supposed to overcome
210 well-known limitations of the venerable <code> chroot(2) </code>
211 system call which was introduced in 1979. Mount namespaces isolate
212 the mount points seen by processes so that processes in different
213 mount namespaces can have different views of the file system hierarchy.
215 Like for other namespace types, new mount namespaces are created by
216 calling <code> clone(2) </code> or <code> unshare(2)</code>. The
217 new mount namespace starts out with a copy of the caller's mount
218 point list. However, with more than one mount namespace the <code>
219 mount(2) </code> and <code> umount(2) </code> system calls no longer
220 operate on a global set of mount points. Whether or not a mount
221 or unmount operation has an effect on processes in different mount
222 namespaces than the caller's is determined by the configurable <em>
223 mount propagation </em> rules. By default, modifications to the list
224 of mount points have only affect the processes which are in the same
225 mount namespace as the process which initiated the modification. This
226 setting is controlled by the <em> propagation type </em> of the
227 mount point. Besides the obvious private and shared types, there is
228 also the <code> MS_SLAVE </code> propagation type which lets mount
229 and unmount events propagate from from a "master" to its "slaves"
230 but not the other way round.
232 SUBSECTION(«Network Namespaces»)
234 Network namespaces not only partition the set of processes, as all
235 six namespace types do, but also the set of network interfaces. That
236 is, each physical or virtual network interface belongs to one (and
237 only one) network namespace. Initially, all interfaces are in the
238 root network namespace. This can be changed with the command <code>
239 ip link set iface netns PID</code>. Processes only see interfaces
240 whose network namespace matches the one they belong to. This lets
241 processes in different network namespaces have different ideas about
242 which network devices exist. Each network namespace has its own IP
243 stack, IP routing table and TCP and UDP ports. This makes it possible
244 to start, for example, many <code> sshd(8) </code> processes which
245 all listen on "their own" TCP port 22.
247 An OS-level virtualization framework typically leaves physical
248 interfaces in the root network namespace but creates a dedicated
249 network namespace and a virtual interface pair for each container. One
250 end of the pair is left in the root namespace while the other end is
251 configured to belong to the dedicated namespace, which contains all
252 processes of the container.
254 SUBSECTION(«PID Namespaces»)
256 This namespace type allows a process to have more than one process
257 ID. Unlike network interfaces which disappear when they enter a
258 different network namespace, a process is still visible in the root
259 namespace after it has entered a different PID namespace. Besides its
260 existing PID it gets a second PID which is only valid inside the target
261 namespace. Similarly, when a new PID namespace is created by passing
262 the <code> CLONE_NEWPID </code> flag to <code> clone(2)</code>, the
263 child process gets some unused PID in the original PID namepspace
264 but PID 1 in the new namespace.
266 As as consequence, processes in different PID namespaces can have the
267 same PID. In particular, there can be arbitrary many "init" processes,
268 which all have PID 1. The usual rules for PID 1 apply within each PID
269 namespace. That is, orphaned processes are reparented to the init
270 process, and it is a fatal error if the init process terminates,
271 causing all processes in the namespace to terminate as well. PID
272 namespaces can be nested, but under normal circumstances they are
273 not. So we won't discuss nesting.
275 Since each process in a non-root PID namespace has also a PID in the
276 root PID namespace, processes in the root PID namespace can "see" all
277 processes but not vice versa. Hence a process in the root namespace can
278 send signals to all processes while processes in the child namespace
279 can only send signals to processes in their own namespace.
281 Processes can be moved from the root PID namespace into a child
282 PID namespace but not the other way round. Moreover, a process can
283 instruct the kernel to create subsequent child processes in a different
284 PID namespace.
286 SUBSECTION(«User Namespaces»)
288 User namespaces have been implemented rather late compared to other
289 namespace types. The implementation was completed in 2013. The purpose
290 of user namespaces is to isolate user and group IDs. Initially there
291 is only one user namespace, the <em> initial namespace </em> to which
292 all processes belong. As with all namespace types, a new user namespace
293 is created with <code> unshare(2) </code> or <code> clone(2)</code>.
295 The UID and GID of a process can be different in different
296 namespaces. In particular, an unprivileged process may have UID
297 0 inside an user namespace. When a process is created in a new
298 namespace or a process joins an existing user namespace, it gains full
299 privileges in this namespace. However, the process has no additional
300 privileges in the parent/previous namespace. Moreover, a certain flag
301 is set for the process which prevents the process from entering yet
302 another namespace with elevated privileges. In particular it does not
303 keep its privileges when it returns to its original namespace. User
304 namespaces can be nested, but we don't discuss nesting here.
306 Each user namespace has an <em> owner</em>, which is the effective user
307 ID (EUID) of the process which created the namespace. Any process
308 in the root user namespace whose EUID matches the owner ID has all
309 capabilities in the child namespace.
311 If <code> CLONE_NEWUSER </code> is specified together with other
312 <code> CLONE_NEW* </code> flags in a single <code> clone(2) </code>
313 or <code> unshare(2) </code> call, the user namespace is guaranteed
314 to be created first, giving the child/caller privileges over the
315 remaining namespaces created by the call.
317 It is possible to map UIDs and GIDs between namespaces. The <code>
318 /proc/$PID/uid_map </code> and <code> /proc/$PID/gid_map </code> files
319 are used to get and set the mappings. We will only talk about UID
320 mappings in the sequel because the mechanism for the GID mappings are
321 analogous. When the <code> /proc/$PID/uid_map </code> (pseudo-)file is
322 read, the contents are computed on the fly and depend on both the user
323 namespace to which process <code> $PID </code> belongs and the user
324 namespace of the calling process. Each line contains three numbers
325 which specify the mapping for a range of UIDs. The numbers have
326 to be interpreted in one of two ways, depending on whether the two
327 processes belong to the same user namespace or not. All system calls
328 which deal with UIDs transparently translate UIDs by consulting these
329 maps. A map for a newly created namespace is established by writing
330 UID-triples <em> once </em> to <em> one </em> <code> uid_map </code>
331 file. Subsequent writes will fail.
333 SUBSECTION(«IPC Namespaces»)
335 System V inter process communication (IPC) subsumes three different
336 mechanisms which enable unrelated processes to communicate with each
337 other. These mechanisms, known as <em> message queues</em>, <em>
338 semaphores </em> and <em> shared memory</em>, predate Linux by at
339 least a decade. They are mandated by the POSIX standard, so every Unix
340 system has to implement the prescribed API. The common characteristic
341 of the System V IPC mechanisms is that their objects are addressed
342 by system-wide IPC <em> identifiers</em> rather than by pathnames.
344 IPC namespaces isolate these resources so that processes in different
345 IPC namespaces have different views of the existing IPC identifiers.
346 When a new IPC namespace is created, it starts out with all three
347 identifier sets empty. Newly created IPC objects are only visible
348 for processes which belong to the same IPC namespace as the process
349 which created the object.
353 <ul>
355 <li> Examine <code> /proc/$$/mounts</code>,
356 <code>/proc/$$/mountinfo</code>, and <code>/proc/$$/mountstats</code>.
357 </li>
359 <li> Recall the concept of a <em> bind mount</em>. Describe the
360 sequence of mount operations a container implementation would need
361 to perform in order to set up a container whose root file system
362 is mounted on, say, <code> /mnt </code> before the container is
363 started. </li>
365 <li> What should happen on the attempt to change a read-only mount
366 to be read-write from inside of a container? </li>
367 <li> Compile and run <code> <a
368 href="#uts_namespace_example">utc-ns.c</a></code>, a minimal C
369 program which illustrates how to create a new UTS namespace. Explain
370 each line of the source code. </li>
372 <li> Run <code> ls -l /proc/$$/ns </code> to see the namespaces of
373 the shell. Run <code> stat -L /proc/$$/ns/uts </code> and confirm
374 that the inode number coincides with the number shown in the target
375 of the link of the <code> ls </code> output.
377 <li> Discuss why creating a namespace is a privileged operation. </li>
379 <li> What is the parent process ID of the init process? Examine the
380 fourth field of <code> /proc/1/stat </code> to confirm. </li>
382 <li> It is possible for a process in a PID namespace to have a parent
383 which is outside of this namespace. This is certainly the case for
384 the process with PID 1. Can this also happen for a different process?
385 </li>
387 <li> Examine the <code> <a
388 href="#pid_namespace_example">pid-ns.c</a></code> program. Will the
389 two numbers printed as <code> PID </code> and <code> child PID </code>
390 be the same? What will be the PPID number? Compile and run the program
391 to see if your guess was correct.
393 <li> Create a veth socket pair. Check that both ends of the pair are
394 visible with <code> ip link show</code>. Start a second shell in a
395 different network namespace and confirm by running the same command
396 that no network interfaces exist in this namespace. In the original
397 namespace, set the namespace of one end of the pair to the process ID
398 of the second shell and confirm that the interface "moved" from one
399 namespace to the other. Configure (different) IP addresses on both ends
400 of the pair and transfer data through the ethernet tunnel between the
401 two shell processes which reside in different network namespaces. </li>
403 <li> Loopback, bridge, ppp and wireless are <em> network namespace
404 local devices</em>, meaning that the namespace of such devices can
405 not be changed. Explain why. Run <code> ethtool -k iface </code>
406 to find out which devices are network namespace local. </li>
408 <li> In a user namespace where the <code> uid_map </code> file has
409 not been written, system calls like <code> setuid(2) </code> which
410 change process UIDs fail. Why? </li>
412 <li> What should happen if a set-user-ID program is executed inside
413 of a user namespace and the on-disk UID of the program is not a mapped
414 UID? </li>
416 <li> Is it possible for a UID to map to different user names even if
417 no user namespaces are in use? </li>
419 </ul>
422 The <code> shmctl(2) </code> system call performs operations on a System V
423 shared memory segment. It operates on a <code> shmid_ds </code> structure
424 which contains in the <code> shm_lpid </code> field the PID of the process
425 which last attached or detached the segment. Describe the implications this API
426 detail has on the interaction between IPC and PID namespaces.
427 »)
429 SECTION(«Control Groups»)
431 <em> Control groups </em> (cgroups) allow processes to be grouped
432 and organized hierarchically in a tree. Each control group contains
433 processes which can be monitored or controlled as a unit, for example
434 by limiting the resources they can occupy. Several <em> controllers
435 </em> exist (CPU, memory, I/O, etc.), some of which actually impose
436 control while others only provide identification and relay control
437 to separate mechanisms. Unfortunately, control groups are not easy to
438 understand because the controllers are implemented in an inconsistent
439 way and because of the rather chaotic relationship between them.
441 In 2014 it was decided to rework the cgroup subsystem of the Linux
442 kernel. To keep existing applications working, the original cgroup
443 implementation, now called <em> cgroup-v1</em>, was retained and a
444 second, incompatible, cgroup implementation was designed. Cgroup-v2
445 aims to address the shortcomings of the first version, including its
446 inefficiency, inconsistency and the lack of interoperability among
447 controllers. The cgroup-v2 API was made official in 2016. Version 1
448 continues to work even if both implementations are active.
450 Both cgroup implementations provide a pseudo file system that
451 must be mounted in order to define and configure cgroups. The two
452 pseudo file systems may be mounted at the same time (on different
453 mountpoints). For both cgroup versions, the standard <code> mkdir(2)
454 </code> system call creates a new cgroup. To add a process to a cgroup
455 one must write its PID to one of the files in the pseudo file system.
457 We will cover both cgroup versions because as of 2018-11 many
458 applications still rely on cgroup-v1 and cgroup-v2 still lacks some
459 of the functionality of cgroup-v1. However, we will not look at
460 all controllers.
462 SUBSECTION(«CPU controllers»)
464 These controllers regulate the distribution of CPU cycles. The <em>
465 cpuset </em> controller of cgroup-v1 is the oldest cgroup controller,
466 it was implemented before the cgroups-v1 subsystem existed, which is
467 why it provides its own pseudo file system which is usually mounted at
468 <code>/dev/cpuset</code>. This file system is only kept for backwards
469 compability and is otherwise equivalent to the corresponding part of
470 the cgroup pseudo file system. The cpuset controller links subsets
471 of CPUs to cgroups so that the processes in a cgroup are confined to
472 run only on the CPUs of "their" subset.
474 The CPU controller of cgroup-v2, which is simply called "cpu", works
475 differently. Instead of specifying the set of admissible CPUs for a
476 cgroup, one defines the ratio of CPU cycles for the cgroup. Work to
477 support CPU partitioning as the cpuset controller of cgroup-v1 is in
478 progress and expected to be ready in 2019.
480 SUBSECTION(«Devices»)
482 The device controller of cgroup-v1 imposes mandatory access control
483 for device-special files. It tracks the <code> open(2) </code> and
484 <code> mknod(2) </code> system calls and enforces the restrictions
485 defined in the <em> device access whitelist </em> of the cgroup the
486 calling process belongs to.
488 Processes in the root cgroup have full permissions. Other cgroups
489 inherit the device permissions from their parent. A child cgroup
490 never has more permission than its parent.
492 Cgroup-v2 takes a completely different approach to device access
493 control. It is implemented on top of BPF, the <em> Berkeley packet
494 filter</em>. Hence this controller is not listed in the cgroup-v2
495 pseudo file system.
497 SUBSECTION(«Freezer»)
499 Both cgroup-v1 and cgroup-v2 implement a <em>freezer</em> controller,
500 which provides an ability to stop ("freeze") all processes in a
501 cgroup to free up resources for other tasks. The stopped processes can
502 be continued ("thawed") as a unit later. This is similar to sending
503 <code>SIGSTOP/SIGCONT</code> to all processes, but avoids some problems
504 with corner cases. The v2 version was added in 2019-07. It is available
505 from Linux-5.2 onwards.
507 SUBSECTION(«Memory»)
509 Cgroup-v1 offers three controllers related to memory management. First
510 there is the cpusetcontroller described above which can be instructed
511 to let processes allocate only memory which is close to the CPUs
512 of the cpuset. This makes sense on NUMA (non-uniform memory access)
513 systems where the memory access time for a given CPU depends on the
514 memory location. Second, the <em> hugetlb </em> controller manages
515 distribution and usage <em> of huge pages</em>. Third, there is the
516 <em> memory resource </em> controller which provides a number of
517 files in the cgroup pseudo file system to limit process memory usage,
518 swap usage and the usage of memory by the kernel on behalf of the
519 process. The most important tunable of the memory resource controller
520 is <code> limit_in_bytes</code>.
522 The cgroup-v2 version of the memory controller is rather more complex
523 because it attempts to limit direct and indirect memory usage of
524 the processes in a cgroup in a bullet-proof way. It is designed to
525 restrain even malicious processes which try to slow down or crash
526 the system by indirectly allocating memory. For example, a process
527 could try to create many threads or file descriptors which all cause a
528 (small) memory allocation in the kernel. Besides several tunables and
529 statistics, the memory controller provides the <code> memory.events
530 </code> file whose contents change whenever a state transition
531 for the cgroup occurs, for example when processes are started to get
532 throttled because the high memory boundary was exceeded. This file
533 could be monitored by a <em> management agent </em> to take appropriate
534 actions. The main mechanism to control the memory usage is the <code>
535 memory.high </code> file.
539 I/O controllers regulate the distribution of IO resources among
540 cgroups. The throttling policy of cgroup-v2 can be used to enforce I/O
541 rate limits on arbitrary block devices, for example on a logical volume
542 provided by the logical volume manager (LVM). Read and write bandwidth
543 may be throttled independently. Moreover, the number of IOPS (I/O
544 operations per second) may also be throttled. The I/O controller of
545 cgroup-v1 is called <em> blkio </em> while for cgroup-v2 it is simply
546 called <em> io</em>. The features of the v1 and v2 I/O controllers
547 are identical but the filenames of the pseudo files and the syntax
548 for setting I/O limits differ. The exercises ask the reader to try
549 out both versions.
551 There is no cgroup-v2 controller for multi-queue schedulers so far.
552 However, there is the <em> I/O Latency </em> controller for cgroup-v2
553 which works for arbitrary block devices and all I/O schedulers. It
554 features <em> I/O workload protection </em> for the processes in
555 a cgroup. This works by throttling the processes in cgroups that
556 have a lower latency target than those in the protected cgroup. The
557 throttling is performed by lowering the depth of the request queue
558 of the affected devices.
562 <ul>
563 <li> Run <code> mount -t cgroup none /var/cgroup </code> and <code>
564 mount -t cgroup2 none /var/cgroup2 </code> to mount both cgroup pseudo
565 file systems and explore the files they provide. </li>
567 <li> Learn how to put the current shell into a new cgroup.
568 Hints: For v1, start with <code> echo 0 > cpuset.mems && echo 0 >
569 cpuset.cpus</code>. For v2: First activate controllers for the cgroup
570 in the parent directory. </li>
572 <li> Set up the cpuset controller so that your shell process has only
573 access to a single CPU core. Test that the limitation is enforced by
574 running <code>stress -c 2</code>. </li>
576 <li> Repeat the above for the cgroup-v2 CPU controller. Hint: <code>
577 echo 1000000 1000000 > cpu.max</code>. </li>
579 <li> In a cgroup with one bash process, start a simple loop that prints
580 some output: <code> while :; do date; sleep 1; done</code>. Freeze
581 and unfreeze the cgroup by writing the string <code> FROZEN </code>
582 to a suitable <code> freezer.state </code> file in the cgroup-v1 file
583 system. Then unfreeze the cgroup by writing <code> THAWED </code>
584 to the same file. Find out how one can tell whether a given cgroup
585 is frozen. </li>
587 <li> Pick a block device to throttle. Estimate its maximal read
588 bandwidth by running a command like <code> ddrescue /dev/sdX
589 /dev/null</code>. Enforce a read bandwidth rate of 1M/s for the
590 device by writing a string of the form <code> "$MAJOR:$MINOR $((1024 *
591 1024))" </code> to a file named <code> blkio.throttle.read_bps_device
592 </code> in the cgroup-v1 pseudo file system. Check that the bandwidth
593 was indeed throttled by running the above <code> ddrescue </code>
594 command again. </li>
596 <li> Repeat the previous exercise, but this time use the cgroup-v2
597 interface for the I/O controller. Hint: write a string of the form
598 <code> $MAJOR:MINOR rbps=$((1024 * 1024))" </code> to a file named
599 <code>io.max</code>. </li>
601 </ul>
604 <ul>
606 <li> In one terminal running <code> bash</code>, start a second <code>
607 bash </code> process and print its PID with <code> echo $$</code>.
608 Guess what happens if you run <code> kill -STOP $PID; kill -CONT
609 $PID</code> from a second terminal, where <code> $PID </code>
610 is the PID that was printed in the first terminal. Try it out,
611 explain the observed behaviour and discuss its impact on the freezer
612 controller. Repeat the experiment but this time use the freezer
613 controller to stop and restart the bash process. </li>
614 </ul>
616 »)
618 SECTION(«Linux Containers»)
620 <p> Containers provide resource management through control groups and
621 resource isolation through namespaces. A <em> container platform </em>
622 is thus a software layer implemented on top of these features. Given a
623 directory containing a Linux root file system, starting the container
624 is a simple matter: First <code> clone(2) </code> is called with the
625 proper <code> NEW_* </code> flags to create a new process in a suitable
626 set of namespaces. The child process then creates a cgroup for the
627 container and puts itself into it. The final step is to let the child
628 process hand over control to the container's <code> /sbin/init </code>
629 by calling <code> exec(2)</code>. When the last process in the newly
630 created namespaces exits, the namespaces disappear and the parent
631 process removes the cgroup. The details are a bit more complicated,
632 but the above covers the essence of what the container startup command
633 has to do. </p>
635 <p> Many container platforms offer additional features not to
636 be discussed here, like downloading and unpacking a file system
637 image from the internet, or supplying the root file system
638 for the container by other means, for example by creating an
639 LVM snapshot of a master image. In this section we look at <a
640 href="http://people.tuebingen.mpg.de/maan/micoforia/">micoforia</a>,
641 a minimalistic container platform to boot a container from an existing
642 root file system as described above. </p>
644 <p> The containers known to micoforia are defined in the single
645 <code>~/.micoforiarc</code> configuration file whose format is
646 described in <code>micoforia(8)</code>. The <code>micoforia</code>
647 command supports various subcommands to maintain containers. For
648 example, containers are started with a command such as <code>micoforia
649 start c1</code> where <code>c1</code> is the name of the
650 container. One can execute a shell running within the container
651 with <code>micoforia enter c1</code>, log in to a local pseudo
652 terminal with <code>micoforia attach c1</code>, or connect via ssh
653 with <code>ssh c1</code>. Of course the latter command only works
654 if the network interface and the DNS record get configured during
655 container startup and the sshd package is installed. The container can
656 be stopped by executing <code>halt</code> from within the container,
657 or by running <code>micoforia stop c1</code> on the host system. The
658 commands <code>micoforia ls</code> and <code>micoforia ps</code>
659 print information about containers and their processes. </p>
661 <p> The exercises ask the reader to install the micoforia package from
662 source, and to set up a minimal container running Ubuntu Linux. </p>
666 <ul>
667 <li> Clone the micoforia git repository from
668 <code>git://git.tuebingen.mpg.de/micoforia</code> and compile the
669 source code with <code>./configure && make</code>. Install with
670 <code>make install</code>. </li>
672 <li> Download a minimal Ubuntu root file system
673 with a command like <code>debootstrap --download-only
674 --include isc-dhcp-client bionic /var/lib/micoforia/c1/
675 http://de.archive.ubuntu.com/ubuntu</code>. </li>
677 <li> Set up an ethernet bridge as described
678 in <code>micoforia(8)</code>. Consult the <a
679 href="./Networking.html#link_layer">Link Layer</a> section of the
680 chapter on networking if you would like to understand what you are
681 doing. </li>
683 <li> Add the following lines to <code>/etc/fstab</code> to configure
684 the cgroup filesystems, create corresponding directories and run
685 <code>mount -a</code> to mount them. </li>
687 <pre>
688 none /var/cgroup cgroup devices 0 0
689 none /var/cgroup2 cgroup2 defaults 0 0
690 </pre>
692 <li> Define a minimal container named <code>c1</code> with <code>echo
693 container c1 > ~/.micoforiarc</code>. This container will have a
694 single network device and neither CPU nor memory isolation will be
695 enforced for the processes of the container. </li>
697 <li> Start the container in foreground mode with <code>micoforia --
698 start -F c1</code>. </li>
700 <li> Run <code>micoforia stop c1</code> to stop the container,
701 edit <code>~/.micoforiarc</code> and add the following two lines to
702 configure memory and CPU limits.
704 <pre>
705 cpu-cores c1:1
706 memory-limit c1:1
707 </pre>
709 Start the container and run suitable commands which show that the
710 newly configured limits are in effect. </li>
712 <li> While the container is running, investigate the control files
713 of the cgroup pseudo file systems. Identify the pseudo files which
714 control the CPU and memory limit of the container. </li>
716 <li> Come up with suitable commands to change the CPU and memory
717 limits of the container while it is running. </li>
719 <li> On the host system, create a loop device and a file system on
720 it. Mount the file system on a subdirectory of the root file system
721 of the container. Note that the mount is not visible from within the
722 container. Come up with a way to make such mounts visible without
723 restarting the container. </li>
724 </ul>
728 SUBSECTION(«UTS Namespace Example»)
729 <pre>
730 <code>
731 #define _GNU_SOURCE
732 #include &lt;sys/utsname.h&gt;
733 #include &lt;sched.h&gt;
734 #include &lt;stdio.h&gt;
735 #include &lt;stdlib.h&gt;
736 #include &lt;unistd.h&gt;
738 static void print_hostname_and_exit(const char *pfx)
739 {
740 struct utsname uts;
742 uname(&uts);
743 printf("%s: %s\n", pfx, uts.nodename);
744 exit(EXIT_SUCCESS);
745 }
747 static int child(void *arg)
748 {
749 sethostname("jesus", 5);
750 print_hostname_and_exit("child");
751 }
753 #define STACK_SIZE (64 * 1024)
754 static char child_stack[STACK_SIZE];
756 int main(int argc, char *argv[])
757 {
758 clone(child, child_stack + STACK_SIZE, CLONE_NEWUTS, NULL);
759 print_hostname_and_exit("parent");
760 }
761 </code>
762 </pre>
764 SUBSECTION(«PID Namespace Example»)
765 <pre>
766 <code>
767 #define _GNU_SOURCE
768 #include &lt;sched.h&gt;
769 #include &lt;unistd.h&gt;
770 #include &lt;stdlib.h&gt;
771 #include &lt;stdio.h&gt;
773 static int child(void *arg)
774 {
775 printf("PID: %d, PPID: %d\n", (int)getpid(), (int)getppid());
776 }
778 #define STACK_SIZE (64 * 1024)
779 static char child_stack[STACK_SIZE];
781 int main(int argc, char *argv[])
782 {
783 pid_t pid = clone(child, child_stack + STACK_SIZE, CLONE_NEWPID, NULL);
784 printf("child PID: %d\n", (int)pid);
785 exit(EXIT_SUCCESS);
786 }
787 </code>
788 </pre>
790 SECTION(«Further Reading»)
791 <ul>
792 <li> <a href="https://lwn.net/Articles/782876/">The creation of the
793 io.latency block I/O controller</a>, by Josef Bacik: </li>
794 </ul>