OS-Level_Virtualization.m4

   1 TITLE(«
   2 Fools ignore complexity. Pragmatists suffer it. Some can avoid it.
   3 Geniuses remove it. -- Perlis's Programming Proverb #58 (1982)
   4 », __file__)
   5
   6
   7 OVERVIEW(«
   8
   9 In general, virtualization refers to the abstraction of computer
  10 resources. This chapter is primarily concerned with <em> server
  11 virtualization</em>, a concept which makes it possible to run
  12 more than one operating system simultaneously and independently
  13 of each other on a single physical computer.  We first describe
  14 the different virtualization frameworks but quickly specialize on
  15 Linux OS-level virtualization and their virtual machines called <em>
  16 containers</em>. Container platforms for Linux are built on top of
  17 <em>namespaces</em> and <em>control groups</em>, the low-level kernel
  18 features which implement abstraction and isolation of processes. We
  19 look at both concepts in some detail.  One of the earliest container
  20 platforms for Linux is <em> LXC </em> (Linux containers) which is
  21 discussed in a dedicated section.
  22
  23 »)
  24
  25 SECTION(«Virtualization Frameworks»)
  26
  27 The origins of server virtualization date back to the 1960s. The
  28 first virtual machine was created as a collaboration between IBM
  29 (International Business Machines) and the MIT (Massachusetts Institute
  30 of Technology). Since then, many different approaches have been
  31 designed, resulting in several <em> Virtualization Frameworks</em>. All
  32 frameworks promise to improve resource utilization and availability, to
  33 reduce costs, and to provide greater flexibility. While some of these
  34 benefits might be real, they do not come for free. Their costs include:
  35 the host becomes a single point of failure, decreased performance,
  36 added complexity and increased maintenance costs due to extensive
  37 debugging, documentation, and maintenance of the VMs. This chapter
  38 briefly describes the three main virtualization frameworks. We list
  39 the advantages and disadvantages of each and give some examples.
  40
  41 SUBSECTION(«Software Virtualization (Emulation)»)
  42
  43 This virtualization framework does not play a significant role in
  44 server virtualization, it is only included for completeness. Emulation
  45 means to imitate a complete hardware architecture in software,
  46 including peripheral devices. All CPU instructions and hardware
  47 interrupts are interpreted by the emulator rather than being run by
  48 native hardware. Since this approach has a large performance penalty,
  49 it is only suitable when speed is not critical. For this reason,
  50 emulation is typically employed for ancient hardware like arcade
  51 game systems and home computers such as the Commodore 64. Despite
  52 the performance penalty, emulation is valuable because it allows
  53 applications and operating systems to run on the current platform as
  54 they did in their original environment.
  55
  56 Examples: Bochs, Mame, VICE.
  57
  58 SUBSECTION(«Paravirtualization and Hardware-Assisted Virtualization»)
  59
  60 These virtualization frameworks are characterized by the presence
  61 of a <em> hypervisor</em>, also known as <em> Virtual Machine
  62 Monitor</em>, which translates system calls from the VMs to native
  63 hardware requests. In contrast to Software Virtualization, the
  64 host OS does not emulate hardware resources but offers a special
  65 APIs to the VMs. If the presented interface is different to that
  66 of the underlying hardware, the term <em> paravirtualization </em>
  67 is used. The guest OS then has to be modified to include modified
  68 (paravirtualized) drivers. In 2005 AMD and Intel added hardware
  69 virtualization instructions to the CPUs and IOMMUs (Input/Output memory
  70 management units) to the chipsets. This allowed VMs to directly execute
  71 privileged instructions and use peripheral devices. This so-called <em>
  72 Hardware-Assisted Virtualization </em> allows unmodified operating
  73 systems to run on the VMs.
  74
  75 The main advantage of Hardware-Assisted Virtualization is its
  76 flexibility, as the host OS does not need to match the OS running on
  77 the VMs. The disadvantages are hardware compatibility constraints and
  78 performance loss. Although these days all hardware has virtualization
  79 support, there are still significant differences in performance between
  80 the host and the VM. Moreover, peripheral devices like storage hardware
  81 has to be compatible with the chipset to make use of the IOMMU.
  82
  83 Examples: KVM (with QEMU as hypervisor), Xen, UML
  84
  85 SUBSECTION(«OS-level Virtualization (Containers)»)
  86
  87 OS-level Virtualization is a technique for lightweight virtualization.
  88 The abstractions are built directly into the kernel and no
  89 hypervisor is needed. In this context the term "virtual machine" is
  90 inaccurate, which is why the OS-level VMs are called differently in
  91 this context. On Linux, they are called <em> containers</em>, other
  92 operating systems call them <em> jails </em> or <em> zones</em>. We
  93 shall exclusively use "container" from now on. All containers share
  94 a single kernel, so the OS running in the container has to match the
  95 host OS. However, each container has its own root file system, so
  96 containers can differ in user space. For example, different containers
  97 can run different Linux distributions. Since programs running in a
  98 container use the normal system call interface to communicate with
  99 the kernel, OS-level Virtualization does not require hardware support
 100 for efficient performance. In fact, OS-level Virtualization imposes
 101 no overhead at all.
 102
 103 OS-level Virtualization is superior to the alternatives because of its
 104 simplicity and its performance. The only disadvantage is the lack of
 105 flexibility. It is simply not an option if some of the VMs must run
 106 different operating systems than the host.
 107
 108 Examples: LXC, Singularity, Docker.
 109
 110 EXERCISES()
 111
 112 <ul>
 113
 114         <li> On any Linux system, check if the processor supports virtualization
 115         by running <code> cat /proc/cpuinfo</code>. Hint: svm and vmx. </li>
 116
 117         <li> Hypervisors come in two flavors called <em> native </em> and <em>
 118         hosted</em>. Explain the difference and the pros and cons of either
 119         flavor. Is QEMU a native or a hosted hypervisor? </li>
 120
 121         <li> Scan through chapter 15 (Secure Virtual Machine) of the
 122
 123                 <a href="https://www.amd.com/system/files/TechDocs/24593.pdf">AMD Programmer's Manual</a>
 124
 125         to get an idea of the complexity of Hardware-Assisted
 126         Virtualization. </li>
 127
 128 </ul>
 129
 130 HOMEWORK(«
 131
 132 <ul>
 133         <li> Recall the concept of <em> direct memory access </em> (DMA)
 134         and explain why DMA is a problem for virtualization. Which of the
 135         three virtualization frameworks of this chapter are affected by this
 136         problem? </li>
 137
 138         <li> Compare AMD's <em> Rapid Virtualization Indexing </em> to Intel's
 139         <em> Extended Page Tables</em>. </li>
 140
 141         <li> Suppose a hacker gained root access to a VM and wishes to proceed
 142         from there to get also full control over the host OS. Discuss the thread
 143         model in the context of the three virtualization frameworks covered
 144         in this section. </li>
 145
 146 </ul>
 147 »)
 148
 149 SECTION(«Namespaces»)
 150
 151 Namespaces partition the set of processes into disjoint subsets
 152 with local scope. Where the traditional Unix systems provided only
 153 a single system-wide resource shared by all processes, the namespace
 154 abstractions make it possible to give processes the illusion of living
 155 in their own isolated instance.  Linux implements the following
 156 six different types of namespaces: mount (Linux-2.4.x, 2002), IPC
 157 (Linux-2.6.19, 2006), UTS (Linux-2.6.19, 2006), PID (Linux-2.6.24,
 158 2008), network (Linux-2.6.29, 2009), UID (Linux-3.8, 2013).
 159 For OS-level virtualization all six name space types are typically
 160 employed to make the containers look like independent systems.
 161
 162 Before we look at each namespace type, we briefly describe how
 163 namespaces are created and how information related to namespaces can
 164 be obtained for a process.
 165
 166 SUBSECTION(«Namespace API»)
 167
 168 <p> Initially, there is only a single namespace of each type called the
 169 <em> root namespace</em>. All processes belong to this namespace. The
 170 <code> clone(2) </code> system call is a generalization of the classic
 171 <code> fork(2) </code> which allows privileged users to create new
 172 namespaces by passing one or more of the six <code> NEW_ </code>
 173 flags. The child process is made a member of the new namespace. Calling
 174 plain <code> fork(2) </code> or <code> clone(2) </code> with no
 175 <code> NEW_* </code> flag lets the newly created process inherit the
 176 namespaces from its parent. There are two additional system calls,
 177 <code> setns(2) </code> and <code> unshare(2) </code> which both
 178 change the namespace(s) of the calling process without creating a
 179 new process. For the latter, there is a user command, also called
 180 <code> unshare(1) </code> which makes the namespace API available to
 181 scripts. </p>
 182
 183 <p> The <code> /proc/$PID </code> directory of each process contains a
 184 <code> ns </code> subdirectory which contains one file per namespace
 185 type. The inode number of this file is the <em> namespace ID</em>.
 186 Hence, by running <code> stat(1) </code> one can tell whether
 187 two different processes belong to the same namespace. Normally a
 188 namespace ceases to exist when the last process in the namespace
 189 terminates. However, by opening <code> /proc/$PID/ns/$TYPE </code>
 190 one can prevent the namespace from disappearing. </p>
 191
 192 SUBSECTION(«UTS Namespaces»)
 193
 194 UTS is short for <em> UNIX Time-sharing System</em>. The old fashioned
 195 word "Time-sharing" has been replaced by <em> multitasking</em>
 196 but the old name lives on in the <code> uname(2) </code> system
 197 call which fills out the fields of a <code> struct utsname</code>.
 198 On return the <code> nodename </code> field of this structure
 199 contains the hostname which was set by a previous call to <code>
 200 sethostname(2)</code>. Similarly, the <code> domainname </code> field
 201 contains the string that was set with <code> setdomainname(2)</code>.
 202
 203 UTS namespaces provide isolation of these two system identifiers. That
 204 is, processes in different UTS namespaces might see different host- and
 205 domain names. Changing the host- or domainname affects only processes
 206 which belong to the same UTS namespace as the process which called
 207 <code> sethostname(2) </code> or <code> setdomainname(2)</code>.
 208
 209 SUBSECTION(«Mount Namespaces»)
 210
 211 The <em> mount namespaces </em> are the oldest Linux namespace
 212 type. This is kind of natural since they are supposed to overcome
 213 well-known limitations of the venerable <code> chroot(2) </code>
 214 system call which was introduced in 1979. Mount namespaces isolate
 215 the mount points seen by processes so that processes in different
 216 mount namespaces can have different views of the file system hierarchy.
 217
 218 Like for other namespace types, new mount namespaces are created by
 219 calling <code> clone(2) </code> or <code> unshare(2)</code>. The
 220 new mount namespace starts out with a copy of the caller's mount
 221 point list.  However, with more than one mount namespace the <code>
 222 mount(2) </code> and <code> umount(2) </code> system calls no longer
 223 operate on a global set of mount points. Whether or not a mount
 224 or unmount operation has an effect on processes in different mount
 225 namespaces than the caller's is determined by the configurable <em>
 226 mount propagation </em> rules. By default, modifications to the list
 227 of mount points have only affect the processes which are in the same
 228 mount namespace as the process which initiated the modification. This
 229 setting is controlled by the <em> propagation type </em> of the
 230 mount point. Besides the obvious private and shared types, there is
 231 also the <code> MS_SLAVE </code> propagation type which lets mount
 232 and unmount events propagate from from a "master" to its "slaves"
 233 but not the other way round.
 234
 235 SUBSECTION(«Network Namespaces»)
 236
 237 Network namespaces not only partition the set of processes, as all
 238 six namespace types do, but also the set of network interfaces. That
 239 is, each physical or virtual network interface belongs to one (and
 240 only one) network namespace. Initially, all interfaces are in the
 241 root network namespace. This can be changed with the command <code>
 242 ip link set iface netns PID</code>. Processes only see interfaces
 243 whose network namespace matches the one they belong to. This lets
 244 processes in different network namespaces have different ideas about
 245 which network devices exist. Each network namespace has its own IP
 246 stack, IP routing table and TCP and UDP ports. This makes it possible
 247 to start, for example, many <code> sshd(8) </code> processes which
 248 all listen on "their own" TCP port 22.
 249
 250 An OS-level virtualization framework typically leaves physical
 251 interfaces in the root network namespace but creates a dedicated
 252 network namespace and a virtual interface pair for each container. One
 253 end of the pair is left in the root namespace while the other end is
 254 configured to belong to the dedicated namespace, which contains all
 255 processes of the container.
 256
 257 SUBSECTION(«PID Namespaces»)
 258
 259 This namespace type allows a process to have more than one process
 260 ID. Unlike network interfaces which disappear when they enter a
 261 different network namespace, a process is still visible in the root
 262 namespace after it has entered a different PID namespace. Besides its
 263 existing PID it gets a second PID which is only valid inside the target
 264 namespace. Similarly, when a new PID namespace is created by passing
 265 the <code> CLONE_NEWPID </code> flag to <code> clone(2)</code>, the
 266 child process gets some unused PID in the original PID namepspace
 267 but PID 1 in the new namespace.
 268
 269 As as consequence, processes in different PID namespaces can have the
 270 same PID. In particular, there can be arbitrary many "init" processes,
 271 which all have PID 1. The usual rules for PID 1 apply within each PID
 272 namespace. That is, orphaned processes are reparented to the init
 273 process, and it is a fatal error if the init process terminates,
 274 causing all processes in the namespace to terminate as well. PID
 275 namespaces can be nested, but under normal circumstances they are
 276 not. So we won't discuss nesting.
 277
 278 Since each process in a non-root PID namespace has also a PID in the
 279 root PID namespace, processes in the root PID namespace can "see" all
 280 processes but not vice versa. Hence a process in the root namespace can
 281 send signals to all processes while processes in the child namespace
 282 can only send signals to processes in their own namespace.
 283
 284 Processes can be moved from the root PID namespace into a child
 285 PID namespace but not the other way round. Moreover, a process can
 286 instruct the kernel to create subsequent child processes in a different
 287 PID namespace.
 288
 289 SUBSECTION(«User Namespaces»)
 290
 291 User namespaces have been implemented rather late compared to other
 292 namespace types. The implementation was completed in 2013. The purpose
 293 of user namespaces is to isolate user and group IDs. Initially there
 294 is only one user namespace, the <em> initial namespace </em> to which
 295 all processes belong. As with all namespace types, a new user namespace
 296 is created with <code> unshare(2) </code> or <code> clone(2)</code>.
 297
 298 The UID and GID of a process can be different in different
 299 namespaces. In particular, an unprivileged process may have UID
 300 0 inside an user namespace. When a process is created in a new
 301 namespace or an process joins an existing user namespace, it gains full
 302 privileges in this namespace. However, the process has no additional
 303 privileges in the parent/previous namespace. Moreover, a certain flag
 304 is set for the process which prevents the process from entering yet
 305 another namespace with elevated privileges. In particular it does not
 306 keep its privileges when it returns to its original namespace. User
 307 namespaces can be nested, but we don't discuss nesting here.
 308
 309 Each user namespace has an <em> owner</em>, which is the effective user
 310 ID (EUID) of the process which created the namespace. Any process
 311 in the root user namespace whose EUID matches the owner ID has all
 312 capabilities in the child namespace.
 313
 314 If <code> CLONE_NEWUSER </code> is specified together with other
 315 <code> CLONE_NEW* </code> flags in a single <code> clone(2) </code>
 316 or <code> unshare(2) </code> call, the user namespace is guaranteed
 317 to be created first, giving the child/caller privileges over the
 318 remaining namespaces created by the call.
 319
 320 It is possible to map UIDs and GIDs between namespaces.  The <code>
 321 /proc/$PID/uid_map </code> and <code> /proc/$PID/gid_map </code> files
 322 are used to get and set the mappings. We will only talk about UID
 323 mappings in the sequel because the mechanism for the GID mappings are
 324 analogous. When the <code> /proc/$PID/uid_map </code> (pseudo-)file is
 325 read, the contents are computed on the fly and depend on both the user
 326 namespace to which process <code> $PID </code> belongs and the user
 327 namespace of the calling process. Each line contains three numbers
 328 which specify the mapping for a range of UIDs. The numbers have
 329 to be interpreted in one of two ways, depending on whether the two
 330 processes belong to the same user namespace or not. All system calls
 331 which deal with UIDs transparently translate UIDs by consulting these
 332 maps. A map for a newly created namespace is established by writing
 333 UID-triples <em> once </em> to <em> one </em> <code> uid_map </code>
 334 file. Subsequent writes will fail.
 335
 336 SUBSECTION(«IPC Namespaces»)
 337
 338 System V inter process communication (IPC) subsumes three different
 339 mechanisms which enable unrelated processes to communicate with each
 340 other. These mechanisms, known as <em> message queues</em>, <em>
 341 semaphores </em> and <em> shared memory</em>, predate Linux by at
 342 least a decade. They are mandated by the POSIX standard, so every Unix
 343 system has to implement the prescribed API. The common characteristic
 344 of the System V IPC mechanisms is that their objects are addressed
 345 by system-wide IPC <em> identifiers</em> rather than by pathnames.
 346
 347 IPC namespaces isolate these resources so that processes in different
 348 IPC namespaces have different views of the existing IPC identifiers.
 349 When a new IPC namespace is created, it starts out with all three
 350 identifier sets empty. Newly created IPC objects are only visible
 351 for processes which belong to the same IPC namespace as the process
 352 which created the object.
 353
 354 EXERCISES()
 355
 356 <ul>
 357
 358         <li> Examine <code> /proc/$$/mounts</code>,
 359         <code>/proc/$$/mountinfo</code>, and <code>/proc/$$/mountstats</code>.
 360         </li>
 361
 362         <li> Recall the concept of a <em> bind mount</em>. Describe the
 363         sequence of mount operations a container implementation would need
 364         to perform in order to set up a container whose root file system
 365         is mounted on, say, <code> /mnt </code> before the container is
 366         started. </li>
 367
 368         <li> What should happen on the attempt to change a read-only mount
 369         to be read-write from inside of a container? </li>
 370         <li> Compile and run <code> <a
 371         href="#uts_namespace_example">utc-ns.c</a></code>, a  minimal C
 372         program which illustrates how to create a new UTS namespace. Explain
 373         each line of the source code. </li>
 374
 375         <li> Run <code> ls -l /proc/$$/ns </code> to see the namespaces of
 376         the shell.  Run <code> stat -L /proc/$$/ns/uts </code> and confirm
 377         that the inode number coincides with the number shown in the target
 378         of the link of the <code> ls </code> output.
 379
 380         <li> Discuss why creating a namespace is a privileged operation. </li>
 381
 382         <li> What is the parent process ID of the init process? Examine the
 383         fourth field of <code> /proc/1/stat </code> to confirm. </li>
 384
 385         <li> It is possible for a process in a PID namespace to have a parent
 386         which is outside of this namespace. This is certainly the case for
 387         the process with PID 1. Can this also happen for a different process?
 388         </li>
 389
 390         <li> Examine the <code> <a
 391         href="#pid_namespace_example">pid-ns.c</a></code> program. Will the
 392         two numbers printed as <code> PID </code> and <code> child PID </code>
 393         be the same? What will be the PPID number? Compile and run the program
 394         to see if your guess was correct.
 395
 396         <li> Create a veth socket pair. Check that both ends of the pair are
 397         visible with <code> ip link show</code>. Start a second shell in a
 398         different network namespace and confirm by running the same command
 399         that no network interfaces exist in this namespace. In the original
 400         namespace, set the namespace of one end of the pair to the process ID
 401         of the second shell and confirm that the interface "moved" from one
 402         namespace to the other. Configure (different) IP addresses on both ends
 403         of the pair and transfer data through the ethernet tunnel between the
 404         two shell processes which reside in different network namespaces. </li>
 405
 406         <li> Loopback, bridge, ppp and wireless are <em> network namespace
 407         local devices</em>, meaning that the namespace of such devices can
 408         not be changed. Explain why. Run <code> ethtool -k iface </code>
 409         to find out which devices are network namespace local. </li>
 410
 411         <li> In a user namespace where the <code> uid_map </code> file has
 412         not been written, system calls like <code> setuid(2) </code> which
 413         change process UIDs fail. Why? </li>
 414
 415         <li> What should happen if a set-user-ID program is executed inside
 416         of a user namespace and the on-disk UID of the program is not a mapped
 417         UID? </li>
 418
 419         <li> Is it possible for a UID to map to different user names even if
 420         no user namespaces are in use? </li>
 421
 422 </ul>
 423
 424 HOMEWORK(«
 425 The <code> shmctl(2) </code> system call performs operations on a System V
 426 shared memory segment.  It operates on a <code> shmid_ds </code> structure
 427 which contains in the <code> shm_lpid </code> field the PID of the process
 428 which last attached or detached the segment. Describe the implications this API
 429 detail has on the interaction between IPC and PID namespaces.
 430 »)
 431
 432 SECTION(«Control Groups»)
 433
 434 <em> Control groups </em> (cgroups) allow processes to be grouped
 435 and organized hierarchically in a tree. Each control group contains
 436 processes which can be monitored or controlled as a unit, for example
 437 by limiting the resources they can occupy. Several <em> controllers
 438 </em> exist (CPU, memory, I/O, etc.), some of which actually impose
 439 control while others only provide identification and relay control
 440 to separate mechanisms. Unfortunately, control groups are not easy to
 441 understand because the controllers are implemented in an inconsistent
 442 way and because of the rather chaotic relationship between them.
 443
 444 In 2014 it was decided to rework the cgroup subsystem of the Linux
 445 kernel. To keep existing applications working, the original cgroup
 446 implementation, now called <em> cgroup-v1</em>, was retained and a
 447 second, incompatible, cgroup implementation was designed. Cgroup-v2
 448 aims to address the shortcomings of the first version, including its
 449 inefficiency, inconsistency and the lack of interoperability among
 450 controllers. The cgroup-v2 API was made official in 2016. Version 1
 451 continues to work even if both implementations are active.
 452
 453 Both cgroup implementations provide a pseudo file system that
 454 must be mounted in order to define and configure cgroups. The two
 455 pseudo file systems may be mounted at the same time (on different
 456 mountpoints). For both cgroup versions, the standard <code> mkdir(2)
 457 </code> system call creates a new cgroup. To add a process to a cgroup
 458 one must write its PID to one of the files in the pseudo file system.
 459
 460 We will cover both cgroup versions because as of 2018-11 many
 461 applications still rely on cgroup-v1 and cgroup-v2 still lacks some
 462 of the functionality of cgroup-v1. However, we will not look at
 463 all controllers.
 464
 465 SUBSECTION(«CPU controllers»)
 466
 467 These controllers regulate the distribution of CPU cycles. The <em>
 468 cpuset </em> controller of cgroup-v1 is the oldest cgroup controller,
 469 it was implemented before the cgroups-v1 subsystem existed, which is
 470 why it provides its own pseudo file system which is usually mounted at
 471 <code>/dev/cpuset</code>. This file system is only kept for backwards
 472 compability and is otherwise equivalent to the corresponding part of
 473 the cgroup pseudo file system.  The cpuset controller links subsets
 474 of CPUs to cgroups so that the processes in a cgroup are confined to
 475 run only on the CPUs of "their" subset.
 476
 477 The CPU controller of cgroup-v2, which is simply called "cpu", works
 478 differently. Instead of specifying the set of admissible CPUs for a
 479 cgroup, one defines the ratio of CPU cycles for the cgroup.  Work to
 480 support CPU partitioning as the cpuset controller of cgroup-v1 is in
 481 progress and expected to be ready in 2019.
 482
 483 SUBSECTION(«Devices»)
 484
 485 The device controller of cgroup-v1 imposes mandatory access control
 486 for device-special files. It tracks the <code> open(2) </code> and
 487 <code> mknod(2) </code> system calls and enforces the restrictions
 488 defined in the <em> device access whitelist </em> of the cgroup the
 489 calling process belongs to.
 490
 491 Processes in the root cgroup have full permissions. Other cgroups
 492 inherit the device permissions from their parent. A child cgroup
 493 never has more permission than its parent.
 494
 495 Cgroup-v2 takes a completely different approach to device access
 496 control. It is implemented on top of BPF, the <em> Berkeley packet
 497 filter</em>. Hence this controller is not listed in the cgroup-v2
 498 pseudo file system.
 499
 500 SUBSECTION(«Freezer»)
 501
 502 Both cgroup-v1 and cgroup-v2 implement a <em>freezer</em> controller,
 503 which provides an ability to stop ("freeze") all processes in a
 504 cgroup to free up resources for other tasks. The stopped processes can
 505 be continued ("thawed") as a unit later. This is similar to sending
 506 <code>SIGSTOP/SIGCONT</code> to all processes, but avoids some problems
 507 with corner cases. The v2 version was added in 2019-07. It is available
 508 from Linux-5.2 onwards.
 509
 510 SUBSECTION(«Memory»)
 511
 512 Cgroup-v1 offers three controllers related to memory management. First
 513 there is the cpusetcontroller described above which can be instructed
 514 to let processes allocate only memory which is close to the CPUs
 515 of the cpuset. This makes sense on NUMA (non-uniform memory access)
 516 systems where the memory access time for a given CPU depends on the
 517 memory location. Second, the <em> hugetlb </em> controller manages
 518 distribution and usage <em> of huge pages</em>. Third, there is the
 519 <em> memory resource </em> controller which provides a number of
 520 files in the cgroup pseudo file system to limit process memory usage,
 521 swap usage and the usage of memory by the kernel on behalf of the
 522 process. The most important tunable of the memory resource controller
 523 is <code> limit_in_bytes</code>.
 524
 525 The cgroup-v2 version of the memory controller is rather more complex
 526 because it attempts to limit direct and indirect memory usage of
 527 the processes in a cgroup in a bullet-proof way. It is designed to
 528 restrain even malicious processes which try to slow down or crash
 529 the system by indirectly allocating memory. For example, a process
 530 could try to create many threads or file descriptors which all cause a
 531 (small) memory allocation in the kernel. Besides several tunables and
 532 statistics, the memory controller provides the <code> memory.events
 533 </code> file whose contents change whenever a state transition
 534 for the cgroup occurs, for example when processes are started to get
 535 throttled because the high memory boundary was exceeded. This file
 536 could be monitored by a <em> management agent </em> to take appropriate
 537 actions. The main mechanism to control the memory usage is the <code>
 538 memory.high </code> file.
 539
 540 SUBSECTION(«I/O»)
 541
 542 I/O controllers regulate the distribution of IO resources among
 543 cgroups. The throttling policy of cgroup-v2 can be used to enforce I/O
 544 rate limits on arbitrary block devices, for example on a logical volume
 545 provided by the logical volume manager (LVM). Read and write bandwidth
 546 may be throttled independently. Moreover, the number of IOPS (I/O
 547 operations per second) may also be throttled.  The I/O controller of
 548 cgroup-v1 is called <em> blkio </em> while for cgroup-v2 it is simply
 549 called <em> io</em>.  The features of the v1 and v2 I/O controllers
 550 are identical but the filenames of the pseudo files and the syntax
 551 for setting I/O limits differ. The exercises ask the reader to try
 552 out both versions.
 553
 554 There is no cgroup-v2 controller for multi-queue schedulers so far.
 555 However, there is the <em> I/O Latency </em> controller for cgroup-v2
 556 which works for arbitrary block devices and all I/O schedulers. It
 557 features <em> I/O workload protection </em> for the processes in
 558 a cgroup. This works by throttling the processes in cgroups that
 559 have a lower latency target than those in the protected cgroup. The
 560 throttling is performed by lowering the depth of the request queue
 561 of the affected devices.
 562
 563 EXERCISES()
 564
 565 <ul>
 566         <li> Run <code> mount -t cgroup none /var/cgroup </code> and <code>
 567         mount -t cgroup2 none /var/cgroup2 </code> to mount both cgroup pseudo
 568         file systems and explore the files they provide. </li>
 569
 570         <li> Learn how to put the current shell into a new cgroup.
 571         Hints: For v1, start with <code> echo 0 > cpuset.mems && echo 0 >
 572         cpuset.cpus</code>. For v2: First activate controllers for the cgroup
 573         in the parent directory. </li>
 574
 575         <li> Set up the cpuset controller so that your shell process has only
 576         access to a single CPU core. Test that the limitation is enforced by
 577         running <code>stress -c 2</code>. </li>
 578
 579         <li> Repeat the above for the cgroup-v2 CPU controller. Hint: <code>
 580         echo 1000000 1000000 > cpu.max</code>. </li>
 581
 582         <li> In a cgroup with one bash process, start a simple loop that prints
 583         some output: <code> while :; do date; sleep 1; done</code>. Freeze
 584         and unfreeze the cgroup by writing the string <code> FROZEN </code>
 585         to a suitable <code> freezer.state </code> file in the cgroup-v1 file
 586         system. Then unfreeze the cgroup by writing <code> THAWED </code>
 587         to the same file. Find out how one can tell whether a given cgroup
 588         is frozen. </li>
 589
 590         <li> Pick a block device to throttle. Estimate its maximal read
 591         bandwidth by running a command like <code> ddrescue /dev/sdX
 592         /dev/null</code>.  Enforce a read bandwidth rate of 1M/s for the
 593         device by writing a string of the form <code> "$MAJOR:$MINOR $((1024 *
 594         1024))" </code> to a file named <code> blkio.throttle.read_bps_device
 595         </code> in the cgroup-v1 pseudo file system. Check that the bandwidth
 596         was indeed throttled by running the above <code> ddrescue </code>
 597         command again. </li>
 598
 599         <li> Repeat the previous exercise, but this time use the cgroup-v2
 600         interface for the I/O controller. Hint: write a string of the form
 601         <code> $MAJOR:MINOR rbps=$((1024 * 1024))" </code> to a file named
 602         <code>io.max</code>. </li>
 603
 604 </ul>
 605
 606 HOMEWORK(«
 607 <ul>
 608
 609         <li> In one terminal running <code> bash</code>, start a second <code>
 610         bash </code> process and print its PID with <code> echo $$</code>.
 611         Guess what happens if you run <code> kill -STOP $PID; kill -CONT
 612         $PID</code> from a second terminal, where <code> $PID </code>
 613         is the PID that was printed in the first terminal. Try it out,
 614         explain the observed behaviour and discuss its impact on the freezer
 615         controller. Repeat the experiment but this time use the freezer
 616         controller to stop and restart the bash process. </li>
 617 </ul>
 618
 619 »)
 620
 621 SECTION(«Linux Containers (LXC)»)
 622
 623 Containers provide resource management through control groups and
 624 resource isolation through namespaces. A <em> container platform </em>
 625 is thus a software layer implemented on top of these features. Given a
 626 directory containing a Linux root file system, starting the container
 627 is a simple matter: First <code> clone(2) </code> is called with the
 628 proper <code> NEW_* </code> flags to create a new process in a suitable
 629 set of namespaces. The child process then creates a cgroup for the
 630 container and puts itself into it. The final step is to let the child
 631 process hand over control to the container's <code> /sbin/init </code>
 632 by calling <code> exec(2)</code>. When the last process in the newly
 633 created namespaces exits, the namespaces disappear and the parent
 634 process removes the cgroup. The details are a bit more complicated,
 635 but the above covers the essence of what the container startup command
 636 has to do.
 637
 638 Many container platforms offer additional features not to be discussed
 639 here, like downloading and unpacking a file system image from the
 640 internet, or supplying the root file system for the container by other
 641 means, for example by creating an LVM snapshot of a master image.
 642 LXC is a comparably simple container platform which can be used to
 643 start a single daemon in a container, or to boot a container from
 644 a root file system as described above. It provides several <code>
 645 lxc-* </code> commands to start, stop and maintain containers.
 646 LXC version 1 is much simpler than subsequent versions, and is still
 647 being maintained, so we only discuss this version of LXC here.
 648
 649 An LXC container is defined by a configuration file in
 650 the format described in <code> lxc.conf(5)</code>. A <a
 651 href="#minimal_lxc_config_file"> minimal configuration </a> which
 652 defines a network device and requests CPU and memory isolation has
 653 as few as 10 lines (not counting comments). With the configuration
 654 file and the root file system in place, the container can be started
 655 by running <code> lxc-start -n $NAME</code>. One can log in to the
 656 container on the local pseudo terminal or via ssh (provided the sshd
 657 package is installed). The container can be stopped by executing
 658 <code> halt </code> from within the container, or by running <code>
 659 lxc-stop </code> on the host system. <code> lxc-ls </code> and
 660 <code> lxc-info</code> print information about containers, and <code>
 661 lxc-cgroup </code> changes the settings of the cgroup associated with
 662 a container.
 663
 664 The exercises ask the reader to install the LXC package from source,
 665 and to set up a minimal container running Ubuntu-18.04.
 666
 667 EXERCISES()
 668
 669 <ul>
 670
 671         <li> Clone the LXC git repository from <code>
 672         https://github.com/lxc/lxc</code>, check out the <code> stable-1.0
 673         </code> tag. Compile the source code with <code> ./autogen.sh </code>
 674         and <code> ./configure && make</code>. Install with <code> sudo make
 675         install</code>. </li>
 676
 677         <li> Download a minimal Ubuntu root file system with a command like
 678         <code> debootstrap --download-only --include isc-dhcp-client bionic
 679         /media/lxc/buru/ http://de.archive.ubuntu.com/ubuntu</code>. </li>
 680
 681         <li> Set up an ethernet bridge as described in the <a
 682         href="./Networking.html#link_layer">Link Layer</a> section of the
 683         chapter on networking. </li>
 684
 685         <li> Examine the <a href="#minimal_lxc_config_file"> minimal
 686         configuration file </a> for the container and copy it to <code>
 687         /var/lib/lxc/buru/config</code>. Adjust host name, MAC address and
 688         the name of the bridge interface. </li>
 689
 690         <li> Start the container with <code> lxc-start -n buru</code>. </li>
 691
 692         <li> While the container is running, investigate the control files of the
 693         cgroup pseudo file system. Identify the pseudo files which describe the
 694         CPU and memory limit. </li>
 695
 696         <li> Come up with a suitable <code> lxc-cgroup </code> command
 697         to change the cpuset and the memory of the container while it is
 698         running. </li>
 699
 700         <li> On the host system, create a loop device and a file system on
 701         it. Mount the file system on a subdirectory of the root file system
 702         of the container.  Note that the mount is not visible from within the
 703         container. Come up with a way to make it visible without restarting
 704         the container. </li>
 705
 706 </ul>
 707
 708 HOMEWORK(«Compare the features of LXC versions 1, 2 and 3.»)
 709
 710 SUPPLEMENTS()
 711
 712 SUBSECTION(«UTS Namespace Example»)
 713 <pre>
 714         <code>
 715                 #define _GNU_SOURCE
 716                 #include &lt;sys/utsname.h&gt;
 717                 #include &lt;sched.h&gt;
 718                 #include &lt;stdio.h&gt;
 719                 #include &lt;stdlib.h&gt;
 720                 #include &lt;unistd.h&gt;
 721
 722                 static void print_hostname_and_exit(const char *pfx)
 723                 {
 724                         struct utsname uts;
 725
 726                         uname(&uts);
 727                         printf("%s: %s\n", pfx, uts.nodename);
 728                         exit(EXIT_SUCCESS);
 729                 }
 730
 731                 static int child(void *arg)
 732                 {
 733                         sethostname("jesus", 5);
 734                         print_hostname_and_exit("child");
 735                 }
 736
 737                 #define STACK_SIZE (64 * 1024)
 738                 static char child_stack[STACK_SIZE];
 739
 740                 int main(int argc, char *argv[])
 741                 {
 742                         clone(child, child_stack + STACK_SIZE, CLONE_NEWUTS, NULL);
 743                         print_hostname_and_exit("parent");
 744                 }
 745         </code>
 746 </pre>
 747
 748 SUBSECTION(«PID Namespace Example»)
 749 <pre>
 750         <code>
 751                 #define _GNU_SOURCE
 752                 #include &lt;sched.h&gt;
 753                 #include &lt;unistd.h&gt;
 754                 #include &lt;stdlib.h&gt;
 755                 #include &lt;stdio.h&gt;
 756
 757                 static int child(void *arg)
 758                 {
 759                         printf("PID: %d, PPID: %d\n", (int)getpid(), (int)getppid());
 760                 }
 761
 762                 #define STACK_SIZE (64 * 1024)
 763                 static char child_stack[STACK_SIZE];
 764
 765                 int main(int argc, char *argv[])
 766                 {
 767                         pid_t pid = clone(child, child_stack + STACK_SIZE, CLONE_NEWPID, NULL);
 768                         printf("child PID: %d\n", (int)pid);
 769                         exit(EXIT_SUCCESS);
 770                 }
 771         </code>
 772 </pre>
 773
 774 SUBSECTION(«Minimal LXC Config File»)
 775 <pre>
 776         <code>
 777                 # Employ cgroups to limit the CPUs and the amount of memory the container is
 778                 # allowed to use.
 779                 lxc.cgroup.cpuset.cpus = 0-1
 780                 lxc.cgroup.memory.limit_in_bytes = 2G
 781
 782                 # So that the container starts out with a fresh UTS namespace that
 783                 # has already set its hostname.
 784                 lxc.utsname = buru
 785
 786                 # LXC does not play ball if we don't set the type of the network device.
 787                 # It will always be veth.
 788                 lxc.network.type = veth
 789
 790                 # This sets the name of the veth pair which is visible on the host. This
 791                 # way it is easy to tell which interface belongs to which container.
 792                 lxc.network.veth.pair = buru
 793
 794                 # Of course we need to tell LXC where the root file system of the container
 795                 # is located. LXC will automatically mount a couple of pseudo file systems
 796                 # for the container, including /proc and /sys.
 797                 lxc.rootfs = /media/lxc/buru
 798
 799                 # so that we can assign a fixed address via DHCP
 800                 lxc.network.hwaddr = ac:de:48:32:35:cf
 801
 802                 # You must NOT have a link from /dev/kmsg pointing to /dev/console. In the host
 803                 # it should be a real device. In a container it must NOT exist. When /dev/kmsg
 804                 # points to /dev/console, systemd-journald reads from /dev/kmsg and then writes
 805                 # to /dev/console (which it then reads from /dev/kmsg and writes again to
 806                 # /dev/console ad infinitum). You've inadvertently created a messaging loop
 807                 # that's causing systemd-journald to go berserk on your CPU.
 808                 #
 809                 # Make sure to remove /var/lib/lxc/${container}/rootfs.dev/kmsg
 810                 lxc.kmsg = 0
 811
 812                 lxc.network.link = br39
 813
 814                 # This is needed for lxc-console
 815                 lxc.tty = 4
 816         </code>
 817 </pre>
 818
 819 SECTION(«Further Reading»)
 820 <ul>
 821         <li> <a href="https://lwn.net/Articles/782876/">The creation of the
 822         io.latency block I/O controller</a>, by Josef Bacik: </li>
 823 </ul>