OS-Level_Virtualization.m4

   1 TITLE(«
   2 Fools ignore complexity. Pragmatists suffer it. Some can avoid it.
   3 Geniuses remove it. -- Perlis's Programming Proverb #58 (1982)
   4 », __file__)
   5
   6
   7 OVERVIEW(«
   8
   9 In general, virtualization refers to the abstraction of computer
  10 resources. This chapter is primarily concerned with <em> server
  11 virtualization</em>, a concept which makes it possible to run
  12 more than one operating system simultaneously and independently
  13 of each other on a single physical computer.  We first describe
  14 the different virtualization frameworks but quickly specialize on
  15 Linux OS-level virtualization and their virtual machines called <em>
  16 containers</em>. Container platforms for Linux are built on top of
  17 <em>namespaces</em> and <em>control groups</em>, the low-level kernel
  18 features which implement abstraction and isolation of processes. We
  19 look at both concepts in some detail. The final section discusses
  20 <em>micoforia</em>, a minimal container platform.
  21
  22 »)
  23
  24 SECTION(«Virtualization Frameworks»)
  25
  26 The origins of server virtualization date back to the 1960s. The
  27 first virtual machine was created as a collaboration between IBM
  28 (International Business Machines) and the MIT (Massachusetts Institute
  29 of Technology). Since then, many different approaches have been
  30 designed, resulting in several <em> Virtualization Frameworks</em>. All
  31 frameworks promise to improve resource utilization and availability, to
  32 reduce costs, and to provide greater flexibility. While some of these
  33 benefits might be real, they do not come for free. Their costs include:
  34 the host becomes a single point of failure, decreased performance,
  35 added complexity and increased maintenance costs due to extensive
  36 debugging, documentation, and maintenance of the VMs. This chapter
  37 briefly describes the three main virtualization frameworks. We list
  38 the advantages and disadvantages of each and give some examples.
  39
  40 SUBSECTION(«Software Virtualization (Emulation)»)
  41
  42 This virtualization framework does not play a significant role in
  43 server virtualization, it is only included for completeness. Emulation
  44 means to imitate a complete hardware architecture in software,
  45 including peripheral devices. All CPU instructions and hardware
  46 interrupts are interpreted by the emulator rather than being run by
  47 native hardware. Since this approach has a large performance penalty,
  48 it is only suitable when speed is not critical. For this reason,
  49 emulation is typically employed for ancient hardware like arcade
  50 game systems and home computers such as the Commodore 64. Despite
  51 the performance penalty, emulation is valuable because it allows
  52 applications and operating systems to run on the current platform as
  53 they did in their original environment.
  54
  55 Examples: Bochs, Mame, VICE.
  56
  57 SUBSECTION(«Paravirtualization and Hardware-Assisted Virtualization»)
  58
  59 These virtualization frameworks are characterized by the presence
  60 of a <em> hypervisor</em>, also known as <em> Virtual Machine
  61 Monitor</em>, which translates system calls from the VMs to native
  62 hardware requests. In contrast to Software Virtualization, the
  63 host OS does not emulate hardware resources but offers a special
  64 APIs to the VMs. If the presented interface is different to that
  65 of the underlying hardware, the term <em> paravirtualization </em>
  66 is used. The guest OS then has to be modified to include modified
  67 (paravirtualized) drivers. In 2005 AMD and Intel added hardware
  68 virtualization instructions to the CPUs and IOMMUs (Input/Output memory
  69 management units) to the chipsets. This allowed VMs to directly execute
  70 privileged instructions and use peripheral devices. This so-called <em>
  71 Hardware-Assisted Virtualization </em> allows unmodified operating
  72 systems to run on the VMs.
  73
  74 The main advantage of Hardware-Assisted Virtualization is its
  75 flexibility, as the host OS does not need to match the OS running on
  76 the VMs. The disadvantages are hardware compatibility constraints and
  77 performance loss. Although these days all hardware has virtualization
  78 support, there are still significant differences in performance between
  79 the host and the VM. Moreover, peripheral devices like storage hardware
  80 has to be compatible with the chipset to make use of the IOMMU.
  81
  82 Examples: KVM (with QEMU as hypervisor), Xen, UML
  83
  84 SUBSECTION(«OS-level Virtualization (Containers)»)
  85
  86 OS-level Virtualization is a technique for lightweight virtualization.
  87 The abstractions are built directly into the kernel and no
  88 hypervisor is needed. In this context the term "virtual machine" is
  89 inaccurate, which is why the OS-level VMs are called differently in
  90 this context. On Linux, they are called <em> containers</em>, other
  91 operating systems call them <em> jails </em> or <em> zones</em>. We
  92 shall exclusively use "container" from now on. All containers share
  93 a single kernel, so the OS running in the container has to match the
  94 host OS. However, each container has its own root file system, so
  95 containers can differ in user space. For example, different containers
  96 can run different Linux distributions. Since programs running in a
  97 container use the normal system call interface to communicate with
  98 the kernel, OS-level Virtualization does not require hardware support
  99 for efficient performance. In fact, OS-level Virtualization imposes
 100 no overhead at all.
 101
 102 OS-level Virtualization is superior to the alternatives because of its
 103 simplicity and its performance. The only disadvantage is the lack of
 104 flexibility. It is simply not an option if some of the VMs must run
 105 different operating systems than the host.
 106
 107 Examples: LXC, Micoforia, Singularity, Docker.
 108
 109 EXERCISES()
 110
 111 <ul>
 112
 113         <li> On any Linux system, check if the processor supports virtualization
 114         by running <code> cat /proc/cpuinfo</code>. Hint: svm and vmx. </li>
 115
 116         <li> Hypervisors come in two flavors called <em> native </em> and <em>
 117         hosted</em>. Explain the difference and the pros and cons of either
 118         flavor. Is QEMU a native or a hosted hypervisor? </li>
 119
 120         <li> Find the AMD Programmer's Manual online. The chapter on
 121         "Secure Virtual Machine" describes the CPU instructions related to
 122         Hardware-Assisted Virtualization. Glance over this chapter to get an
 123         idea of the complexity of Hardware-Assisted Virtualization. </li>
 124
 125 </ul>
 126
 127 HOMEWORK(«
 128
 129 <ul>
 130         <li> Recall the concept of <em> direct memory access </em> (DMA)
 131         and explain why DMA is a problem for virtualization. Which of the
 132         three virtualization frameworks of this chapter are affected by this
 133         problem? </li>
 134
 135         <li> Compare AMD's <em> Rapid Virtualization Indexing </em> to Intel's
 136         <em> Extended Page Tables</em>. </li>
 137
 138         <li> Suppose a hacker gained root access to a VM and wishes to proceed
 139         from there to get also full control over the host OS. Discuss the thread
 140         model in the context of the three virtualization frameworks covered
 141         in this section. </li>
 142
 143 </ul>
 144 »)
 145
 146 SECTION(«Namespaces»)
 147
 148 Namespaces partition the set of processes into disjoint subsets
 149 with local scope. Where the traditional Unix systems provided only
 150 a single system-wide resource shared by all processes, the namespace
 151 abstractions make it possible to give processes the illusion of living
 152 in their own isolated instance.  Linux implements the following
 153 six different types of namespaces: mount (Linux-2.4.x, 2002), IPC
 154 (Linux-2.6.19, 2006), UTS (Linux-2.6.19, 2006), PID (Linux-2.6.24,
 155 2008), network (Linux-2.6.29, 2009), UID (Linux-3.8, 2013).
 156 For OS-level virtualization all six name space types are typically
 157 employed to make the containers look like independent systems.
 158
 159 Before we look at each namespace type, we briefly describe how
 160 namespaces are created and how information related to namespaces can
 161 be obtained for a process.
 162
 163 SUBSECTION(«Namespace API»)
 164
 165 <p> Initially, there is only a single namespace of each type called the
 166 <em> root namespace</em>. All processes belong to this namespace. The
 167 <code> clone(2) </code> system call is a generalization of the classic
 168 <code> fork(2) </code> which allows privileged users to create new
 169 namespaces by passing one or more of the six <code> NEW_ </code>
 170 flags. The child process is made a member of the new namespace. Calling
 171 plain <code> fork(2) </code> or <code> clone(2) </code> with no
 172 <code> NEW_* </code> flag lets the newly created process inherit the
 173 namespaces from its parent. There are two additional system calls,
 174 <code> setns(2) </code> and <code> unshare(2) </code> which both
 175 change the namespace(s) of the calling process without creating a
 176 new process. For the latter, there is a user command, also called
 177 <code> unshare(1) </code> which makes the namespace API available to
 178 scripts. </p>
 179
 180 <p> The <code> /proc/$PID </code> directory of each process contains a
 181 <code> ns </code> subdirectory which contains one file per namespace
 182 type. The inode number of this file is the <em> namespace ID</em>.
 183 Hence, by running <code> stat(1) </code> one can tell whether
 184 two different processes belong to the same namespace. Normally a
 185 namespace ceases to exist when the last process in the namespace
 186 terminates. However, by opening <code> /proc/$PID/ns/$TYPE </code>
 187 one can prevent the namespace from disappearing. </p>
 188
 189 SUBSECTION(«UTS Namespaces»)
 190
 191 UTS is short for <em> UNIX Time-sharing System</em>. The old fashioned
 192 word "Time-sharing" has been replaced by <em> multitasking</em>
 193 but the old name lives on in the <code> uname(2) </code> system
 194 call which fills out the fields of a <code> struct utsname</code>.
 195 On return the <code> nodename </code> field of this structure
 196 contains the hostname which was set by a previous call to <code>
 197 sethostname(2)</code>. Similarly, the <code> domainname </code> field
 198 contains the string that was set with <code> setdomainname(2)</code>.
 199
 200 UTS namespaces provide isolation of these two system identifiers. That
 201 is, processes in different UTS namespaces might see different host- and
 202 domain names. Changing the host- or domainname affects only processes
 203 which belong to the same UTS namespace as the process which called
 204 <code> sethostname(2) </code> or <code> setdomainname(2)</code>.
 205
 206 SUBSECTION(«Mount Namespaces»)
 207
 208 The <em> mount namespaces </em> are the oldest Linux namespace
 209 type. This is kind of natural since they are supposed to overcome
 210 well-known limitations of the venerable <code> chroot(2) </code>
 211 system call which was introduced in 1979. Mount namespaces isolate
 212 the mount points seen by processes so that processes in different
 213 mount namespaces can have different views of the file system hierarchy.
 214
 215 Like for other namespace types, new mount namespaces are created by
 216 calling <code> clone(2) </code> or <code> unshare(2)</code>. The
 217 new mount namespace starts out with a copy of the caller's mount
 218 point list.  However, with more than one mount namespace the <code>
 219 mount(2) </code> and <code> umount(2) </code> system calls no longer
 220 operate on a global set of mount points. Whether or not a mount
 221 or unmount operation has an effect on processes in different mount
 222 namespaces than the caller's is determined by the configurable <em>
 223 mount propagation </em> rules. By default, modifications to the list
 224 of mount points have only affect the processes which are in the same
 225 mount namespace as the process which initiated the modification. This
 226 setting is controlled by the <em> propagation type </em> of the
 227 mount point. Besides the obvious private and shared types, there is
 228 also the <code> MS_SLAVE </code> propagation type which lets mount
 229 and unmount events propagate from from a "master" to its "slaves"
 230 but not the other way round.
 231
 232 SUBSECTION(«Network Namespaces»)
 233
 234 Network namespaces not only partition the set of processes, as all
 235 six namespace types do, but also the set of network interfaces. That
 236 is, each physical or virtual network interface belongs to one (and
 237 only one) network namespace. Initially, all interfaces are in the
 238 root network namespace. This can be changed with the command <code>
 239 ip link set iface netns PID</code>. Processes only see interfaces
 240 whose network namespace matches the one they belong to. This lets
 241 processes in different network namespaces have different ideas about
 242 which network devices exist. Each network namespace has its own IP
 243 stack, IP routing table and TCP and UDP ports. This makes it possible
 244 to start, for example, many <code> sshd(8) </code> processes which
 245 all listen on "their own" TCP port 22.
 246
 247 An OS-level virtualization framework typically leaves physical
 248 interfaces in the root network namespace but creates a dedicated
 249 network namespace and a virtual interface pair for each container. One
 250 end of the pair is left in the root namespace while the other end is
 251 configured to belong to the dedicated namespace, which contains all
 252 processes of the container.
 253
 254 SUBSECTION(«PID Namespaces»)
 255
 256 This namespace type allows a process to have more than one process
 257 ID. Unlike network interfaces which disappear when they enter a
 258 different network namespace, a process is still visible in the root
 259 namespace after it has entered a different PID namespace. Besides its
 260 existing PID it gets a second PID which is only valid inside the target
 261 namespace. Similarly, when a new PID namespace is created by passing
 262 the <code> CLONE_NEWPID </code> flag to <code> clone(2)</code>, the
 263 child process gets some unused PID in the original PID namepspace
 264 but PID 1 in the new namespace.
 265
 266 As as consequence, processes in different PID namespaces can have the
 267 same PID. In particular, there can be arbitrary many "init" processes,
 268 which all have PID 1. The usual rules for PID 1 apply within each PID
 269 namespace. That is, orphaned processes are reparented to the init
 270 process, and it is a fatal error if the init process terminates,
 271 causing all processes in the namespace to terminate as well. PID
 272 namespaces can be nested, but under normal circumstances they are
 273 not. So we won't discuss nesting.
 274
 275 Since each process in a non-root PID namespace has also a PID in the
 276 root PID namespace, processes in the root PID namespace can "see" all
 277 processes but not vice versa. Hence a process in the root namespace can
 278 send signals to all processes while processes in the child namespace
 279 can only send signals to processes in their own namespace.
 280
 281 Processes can be moved from the root PID namespace into a child
 282 PID namespace but not the other way round. Moreover, a process can
 283 instruct the kernel to create subsequent child processes in a different
 284 PID namespace.
 285
 286 SUBSECTION(«User Namespaces»)
 287
 288 User namespaces have been implemented rather late compared to other
 289 namespace types. The implementation was completed in 2013. The purpose
 290 of user namespaces is to isolate user and group IDs. Initially there
 291 is only one user namespace, the <em> initial namespace </em> to which
 292 all processes belong. As with all namespace types, a new user namespace
 293 is created with <code> unshare(2) </code> or <code> clone(2)</code>.
 294
 295 The UID and GID of a process can be different in different
 296 namespaces. In particular, an unprivileged process may have UID
 297 0 inside an user namespace. When a process is created in a new
 298 namespace or a process joins an existing user namespace, it gains full
 299 privileges in this namespace. However, the process has no additional
 300 privileges in the parent/previous namespace. Moreover, a certain flag
 301 is set for the process which prevents the process from entering yet
 302 another namespace with elevated privileges. In particular it does not
 303 keep its privileges when it returns to its original namespace. User
 304 namespaces can be nested, but we don't discuss nesting here.
 305
 306 Each user namespace has an <em> owner</em>, which is the effective user
 307 ID (EUID) of the process which created the namespace. Any process
 308 in the root user namespace whose EUID matches the owner ID has all
 309 capabilities in the child namespace.
 310
 311 If <code> CLONE_NEWUSER </code> is specified together with other
 312 <code> CLONE_NEW* </code> flags in a single <code> clone(2) </code>
 313 or <code> unshare(2) </code> call, the user namespace is guaranteed
 314 to be created first, giving the child/caller privileges over the
 315 remaining namespaces created by the call.
 316
 317 It is possible to map UIDs and GIDs between namespaces.  The <code>
 318 /proc/$PID/uid_map </code> and <code> /proc/$PID/gid_map </code> files
 319 are used to get and set the mappings. We will only talk about UID
 320 mappings in the sequel because the mechanism for the GID mappings are
 321 analogous. When the <code> /proc/$PID/uid_map </code> (pseudo-)file is
 322 read, the contents are computed on the fly and depend on both the user
 323 namespace to which process <code> $PID </code> belongs and the user
 324 namespace of the calling process. Each line contains three numbers
 325 which specify the mapping for a range of UIDs. The numbers have
 326 to be interpreted in one of two ways, depending on whether the two
 327 processes belong to the same user namespace or not. All system calls
 328 which deal with UIDs transparently translate UIDs by consulting these
 329 maps. A map for a newly created namespace is established by writing
 330 UID-triples <em> once </em> to <em> one </em> <code> uid_map </code>
 331 file. Subsequent writes will fail.
 332
 333 SUBSECTION(«IPC Namespaces»)
 334
 335 System V inter process communication (IPC) subsumes three different
 336 mechanisms which enable unrelated processes to communicate with each
 337 other. These mechanisms, known as <em> message queues</em>, <em>
 338 semaphores </em> and <em> shared memory</em>, predate Linux by at
 339 least a decade. They are mandated by the POSIX standard, so every Unix
 340 system has to implement the prescribed API. The common characteristic
 341 of the System V IPC mechanisms is that their objects are addressed
 342 by system-wide IPC <em> identifiers</em> rather than by pathnames.
 343
 344 IPC namespaces isolate these resources so that processes in different
 345 IPC namespaces have different views of the existing IPC identifiers.
 346 When a new IPC namespace is created, it starts out with all three
 347 identifier sets empty. Newly created IPC objects are only visible
 348 for processes which belong to the same IPC namespace as the process
 349 which created the object.
 350
 351 EXERCISES()
 352
 353 <ul>
 354
 355         <li> Examine <code> /proc/$$/mounts</code>,
 356         <code>/proc/$$/mountinfo</code>, and <code>/proc/$$/mountstats</code>.
 357         </li>
 358
 359         <li> Recall the concept of a <em> bind mount</em>. Describe the
 360         sequence of mount operations a container implementation would need
 361         to perform in order to set up a container whose root file system
 362         is mounted on, say, <code> /mnt </code> before the container is
 363         started. </li>
 364
 365         <li> What should happen on the attempt to change a read-only mount
 366         to be read-write from inside of a container? </li>
 367         <li> Compile and run <code> <a
 368         href="#uts_namespace_example">utc-ns.c</a></code>, a  minimal C
 369         program which illustrates how to create a new UTS namespace. Explain
 370         each line of the source code. </li>
 371
 372         <li> Run <code> ls -l /proc/$$/ns </code> to see the namespaces of
 373         the shell.  Run <code> stat -L /proc/$$/ns/uts </code> and confirm
 374         that the inode number coincides with the number shown in the target
 375         of the link of the <code> ls </code> output.
 376
 377         <li> Discuss why creating a namespace is a privileged operation. </li>
 378
 379         <li> What is the parent process ID of the init process? Examine the
 380         fourth field of <code> /proc/1/stat </code> to confirm. </li>
 381
 382         <li> It is possible for a process in a PID namespace to have a parent
 383         which is outside of this namespace. This is certainly the case for
 384         the process with PID 1. Can this also happen for a different process?
 385         </li>
 386
 387         <li> Examine the <code> <a
 388         href="#pid_namespace_example">pid-ns.c</a></code> program. Will the
 389         two numbers printed as <code> PID </code> and <code> child PID </code>
 390         be the same? What will be the PPID number? Compile and run the program
 391         to see if your guess was correct.
 392
 393         <li> Create a veth socket pair. Check that both ends of the pair are
 394         visible with <code> ip link show</code>. Start a second shell in a
 395         different network namespace and confirm by running the same command
 396         that no network interfaces exist in this namespace. In the original
 397         namespace, set the namespace of one end of the pair to the process ID
 398         of the second shell and confirm that the interface "moved" from one
 399         namespace to the other. Configure (different) IP addresses on both ends
 400         of the pair and transfer data through the ethernet tunnel between the
 401         two shell processes which reside in different network namespaces. </li>
 402
 403         <li> Loopback, bridge, ppp and wireless are <em> network namespace
 404         local devices</em>, meaning that the namespace of such devices can
 405         not be changed. Explain why. Run <code> ethtool -k iface </code>
 406         to find out which devices are network namespace local. </li>
 407
 408         <li> In a user namespace where the <code> uid_map </code> file has
 409         not been written, system calls like <code> setuid(2) </code> which
 410         change process UIDs fail. Why? </li>
 411
 412         <li> What should happen if a set-user-ID program is executed inside
 413         of a user namespace and the on-disk UID of the program is not a mapped
 414         UID? </li>
 415
 416         <li> Is it possible for a UID to map to different user names even if
 417         no user namespaces are in use? </li>
 418
 419 </ul>
 420
 421 HOMEWORK(«
 422 The <code> shmctl(2) </code> system call performs operations on a System V
 423 shared memory segment.  It operates on a <code> shmid_ds </code> structure
 424 which contains in the <code> shm_lpid </code> field the PID of the process
 425 which last attached or detached the segment. Describe the implications this API
 426 detail has on the interaction between IPC and PID namespaces.
 427 »)
 428
 429 SECTION(«Control Groups»)
 430
 431 <em> Control groups </em> (cgroups) allow processes to be grouped
 432 and organized hierarchically in a tree. Each control group contains
 433 processes which can be monitored or controlled as a unit, for example
 434 by limiting the resources they can occupy. Several <em> controllers
 435 </em> exist (CPU, memory, I/O, etc.), some of which actually impose
 436 control while others only provide identification and relay control
 437 to separate mechanisms. Unfortunately, control groups are not easy to
 438 understand because the controllers are implemented in an inconsistent
 439 way and because of the rather chaotic relationship between them.
 440
 441 In 2014 it was decided to rework the cgroup subsystem of the Linux
 442 kernel. To keep existing applications working, the original cgroup
 443 implementation, now called <em> cgroup-v1</em>, was retained and a
 444 second, incompatible, cgroup implementation was designed. Cgroup-v2
 445 aims to address the shortcomings of the first version, including its
 446 inefficiency, inconsistency and the lack of interoperability among
 447 controllers. The cgroup-v2 API was made official in 2016. Version 1
 448 continues to work even if both implementations are active.
 449
 450 Both cgroup implementations provide a pseudo file system that
 451 must be mounted in order to define and configure cgroups. The two
 452 pseudo file systems may be mounted at the same time (on different
 453 mountpoints). For both cgroup versions, the standard <code> mkdir(2)
 454 </code> system call creates a new cgroup. To add a process to a cgroup
 455 one must write its PID to one of the files in the pseudo file system.
 456
 457 We will cover both cgroup versions because as of 2018-11 many
 458 applications still rely on cgroup-v1 and cgroup-v2 still lacks some
 459 of the functionality of cgroup-v1. However, we will not look at
 460 all controllers.
 461
 462 SUBSECTION(«CPU controllers»)
 463
 464 These controllers regulate the distribution of CPU cycles. The <em>
 465 cpuset </em> controller of cgroup-v1 is the oldest cgroup controller,
 466 it was implemented before the cgroups-v1 subsystem existed, which is
 467 why it provides its own pseudo file system which is usually mounted at
 468 <code>/dev/cpuset</code>. This file system is only kept for backwards
 469 compability and is otherwise equivalent to the corresponding part of
 470 the cgroup pseudo file system.  The cpuset controller links subsets
 471 of CPUs to cgroups so that the processes in a cgroup are confined to
 472 run only on the CPUs of "their" subset.
 473
 474 The CPU controller of cgroup-v2, which is simply called "cpu", works
 475 differently. Instead of specifying the set of admissible CPUs for a
 476 cgroup, one defines the ratio of CPU cycles for the cgroup.  Work to
 477 support CPU partitioning as the cpuset controller of cgroup-v1 is in
 478 progress and expected to be ready in 2019.
 479
 480 SUBSECTION(«Devices»)
 481
 482 The device controller of cgroup-v1 imposes mandatory access control
 483 for device-special files. It tracks the <code> open(2) </code> and
 484 <code> mknod(2) </code> system calls and enforces the restrictions
 485 defined in the <em> device access whitelist </em> of the cgroup the
 486 calling process belongs to.
 487
 488 Processes in the root cgroup have full permissions. Other cgroups
 489 inherit the device permissions from their parent. A child cgroup
 490 never has more permission than its parent.
 491
 492 Cgroup-v2 takes a completely different approach to device access
 493 control. It is implemented on top of BPF, the <em> Berkeley packet
 494 filter</em>. Hence this controller is not listed in the cgroup-v2
 495 pseudo file system.
 496
 497 SUBSECTION(«Freezer»)
 498
 499 Both cgroup-v1 and cgroup-v2 implement a <em>freezer</em> controller,
 500 which provides an ability to stop ("freeze") all processes in a
 501 cgroup to free up resources for other tasks. The stopped processes can
 502 be continued ("thawed") as a unit later. This is similar to sending
 503 <code>SIGSTOP/SIGCONT</code> to all processes, but avoids some problems
 504 with corner cases. The v2 version was added in 2019-07. It is available
 505 from Linux-5.2 onwards.
 506
 507 SUBSECTION(«Memory»)
 508
 509 Cgroup-v1 offers three controllers related to memory management. First
 510 there is the cpusetcontroller described above which can be instructed
 511 to let processes allocate only memory which is close to the CPUs
 512 of the cpuset. This makes sense on NUMA (non-uniform memory access)
 513 systems where the memory access time for a given CPU depends on the
 514 memory location. Second, the <em> hugetlb </em> controller manages
 515 distribution and usage <em> of huge pages</em>. Third, there is the
 516 <em> memory resource </em> controller which provides a number of
 517 files in the cgroup pseudo file system to limit process memory usage,
 518 swap usage and the usage of memory by the kernel on behalf of the
 519 process. The most important tunable of the memory resource controller
 520 is <code> limit_in_bytes</code>.
 521
 522 The cgroup-v2 version of the memory controller is rather more complex
 523 because it attempts to limit direct and indirect memory usage of
 524 the processes in a cgroup in a bullet-proof way. It is designed to
 525 restrain even malicious processes which try to slow down or crash
 526 the system by indirectly allocating memory. For example, a process
 527 could try to create many threads or file descriptors which all cause a
 528 (small) memory allocation in the kernel. Besides several tunables and
 529 statistics, the memory controller provides the <code> memory.events
 530 </code> file whose contents change whenever a state transition
 531 for the cgroup occurs, for example when processes are started to get
 532 throttled because the high memory boundary was exceeded. This file
 533 could be monitored by a <em> management agent </em> to take appropriate
 534 actions. The main mechanism to control the memory usage is the <code>
 535 memory.high </code> file.
 536
 537 SUBSECTION(«I/O»)
 538
 539 I/O controllers regulate the distribution of IO resources among
 540 cgroups. The throttling policy of cgroup-v2 can be used to enforce I/O
 541 rate limits on arbitrary block devices, for example on a logical volume
 542 provided by the logical volume manager (LVM). Read and write bandwidth
 543 may be throttled independently. Moreover, the number of IOPS (I/O
 544 operations per second) may also be throttled.  The I/O controller of
 545 cgroup-v1 is called <em> blkio </em> while for cgroup-v2 it is simply
 546 called <em> io</em>.  The features of the v1 and v2 I/O controllers
 547 are identical but the filenames of the pseudo files and the syntax
 548 for setting I/O limits differ. The exercises ask the reader to try
 549 out both versions.
 550
 551 There is no cgroup-v2 controller for multi-queue schedulers so far.
 552 However, there is the <em> I/O Latency </em> controller for cgroup-v2
 553 which works for arbitrary block devices and all I/O schedulers. It
 554 features <em> I/O workload protection </em> for the processes in
 555 a cgroup. This works by throttling the processes in cgroups that
 556 have a lower latency target than those in the protected cgroup. The
 557 throttling is performed by lowering the depth of the request queue
 558 of the affected devices.
 559
 560 EXERCISES()
 561
 562 <ul>
 563         <li> Run <code> mount -t cgroup none /var/cgroup </code> and <code>
 564         mount -t cgroup2 none /var/cgroup2 </code> to mount both cgroup pseudo
 565         file systems and explore the files they provide. </li>
 566
 567         <li> Learn how to put the current shell into a new cgroup.
 568         Hints: For v1, start with <code> echo 0 > cpuset.mems && echo 0 >
 569         cpuset.cpus</code>. For v2: First activate controllers for the cgroup
 570         in the parent directory. </li>
 571
 572         <li> Set up the cpuset controller so that your shell process has only
 573         access to a single CPU core. Test that the limitation is enforced by
 574         running <code>stress -c 2</code>. </li>
 575
 576         <li> Repeat the above for the cgroup-v2 CPU controller. Hint: <code>
 577         echo 1000000 1000000 > cpu.max</code>. </li>
 578
 579         <li> In a cgroup with one bash process, start a simple loop that prints
 580         some output: <code> while :; do date; sleep 1; done</code>. Freeze
 581         and unfreeze the cgroup by writing the string <code> FROZEN </code>
 582         to a suitable <code> freezer.state </code> file in the cgroup-v1 file
 583         system. Then unfreeze the cgroup by writing <code> THAWED </code>
 584         to the same file. Find out how one can tell whether a given cgroup
 585         is frozen. </li>
 586
 587         <li> Pick a block device to throttle. Estimate its maximal read
 588         bandwidth by running a command like <code> ddrescue /dev/sdX
 589         /dev/null</code>.  Enforce a read bandwidth rate of 1M/s for the
 590         device by writing a string of the form <code> "$MAJOR:$MINOR $((1024 *
 591         1024))" </code> to a file named <code> blkio.throttle.read_bps_device
 592         </code> in the cgroup-v1 pseudo file system. Check that the bandwidth
 593         was indeed throttled by running the above <code> ddrescue </code>
 594         command again. </li>
 595
 596         <li> Repeat the previous exercise, but this time use the cgroup-v2
 597         interface for the I/O controller. Hint: write a string of the form
 598         <code> $MAJOR:MINOR rbps=$((1024 * 1024))" </code> to a file named
 599         <code>io.max</code>. </li>
 600
 601 </ul>
 602
 603 HOMEWORK(«
 604 <ul>
 605
 606         <li> In one terminal running <code> bash</code>, start a second <code>
 607         bash </code> process and print its PID with <code> echo $$</code>.
 608         Guess what happens if you run <code> kill -STOP $PID; kill -CONT
 609         $PID</code> from a second terminal, where <code> $PID </code>
 610         is the PID that was printed in the first terminal. Try it out,
 611         explain the observed behaviour and discuss its impact on the freezer
 612         controller. Repeat the experiment but this time use the freezer
 613         controller to stop and restart the bash process. </li>
 614 </ul>
 615
 616 »)
 617
 618 SECTION(«Linux Containers»)
 619
 620 <p> Containers provide resource management through control groups and
 621 resource isolation through namespaces. A <em> container platform </em>
 622 is thus a software layer implemented on top of these features. Given a
 623 directory containing a Linux root file system, starting the container
 624 is a simple matter: First <code> clone(2) </code> is called with the
 625 proper <code> NEW_* </code> flags to create a new process in a suitable
 626 set of namespaces. The child process then creates a cgroup for the
 627 container and puts itself into it. The final step is to let the child
 628 process hand over control to the container's <code> /sbin/init </code>
 629 by calling <code> exec(2)</code>. When the last process in the newly
 630 created namespaces exits, the namespaces disappear and the parent
 631 process removes the cgroup. The details are a bit more complicated,
 632 but the above covers the essence of what the container startup command
 633 has to do. </p>
 634
 635 <p> Many container platforms offer additional features not to
 636 be discussed here, like downloading and unpacking a file system
 637 image from the internet, or supplying the root file system
 638 for the container by other means, for example by creating an
 639 LVM snapshot of a master image. In this section we look at <a
 640 href="http://people.tuebingen.mpg.de/maan/micoforia/">micoforia</a>,
 641 a minimalistic container platform to boot a container from an existing
 642 root file system as described above. </p>
 643
 644 <p> The containers known to micoforia are defined in the single
 645 <code>~/.micoforiarc</code> configuration file whose format is
 646 described in <code>micoforia(8)</code>. The <code>micoforia</code>
 647 command supports various subcommands to maintain containers. For
 648 example, containers are started with a command such as <code>micoforia
 649 start c1</code> where <code>c1</code> is the name of the
 650 container. One can execute a shell running within the container
 651 with <code>micoforia enter c1</code>, log in to a local pseudo
 652 terminal with <code>micoforia attach c1</code>, or connect via ssh
 653 with <code>ssh c1</code>. Of course the latter command only works
 654 if the network interface and the DNS record get configured during
 655 container startup and the sshd package is installed. The container can
 656 be stopped by executing <code>halt</code> from within the container,
 657 or by running <code>micoforia stop c1</code> on the host system. The
 658 commands <code>micoforia ls</code> and <code>micoforia ps</code>
 659 print information about containers and their processes. </p>
 660
 661 <p> The exercises ask the reader to install the micoforia package from
 662 source, and to set up a minimal container running Ubuntu Linux. </p>
 663
 664 EXERCISES()
 665
 666 <ul>
 667         <li> Clone the micoforia git repository from
 668         <code>git://git.tuebingen.mpg.de/micoforia</code> and compile the
 669         source code with <code>./configure && make</code>. Install with
 670         <code>make install</code>. </li>
 671
 672         <li> Download a minimal Ubuntu root file system
 673         with a command like <code>debootstrap --download-only
 674         --include isc-dhcp-client bionic /var/lib/micoforia/c1/
 675         http://de.archive.ubuntu.com/ubuntu</code>. </li>
 676
 677         <li> Set up an ethernet bridge as described
 678         in <code>micoforia(8)</code>. Consult the <a
 679         href="./Networking.html#link_layer">Link Layer</a> section of the
 680         chapter on networking if you would like to understand what you are
 681         doing. </li>
 682
 683         <li> Add the following lines to <code>/etc/fstab</code> to configure
 684         the cgroup filesystems, create corresponding directories and run
 685         <code>mount -a</code> to mount them. </li>
 686
 687         <pre>
 688                 none /var/cgroup cgroup devices 0 0
 689                 none /var/cgroup2 cgroup2 defaults 0 0
 690         </pre>
 691
 692         <li> Define a minimal container named <code>c1</code> with <code>echo
 693         container c1 > ~/.micoforiarc</code>. This container will have a
 694         single network device and neither CPU nor memory isolation will be
 695         enforced for the processes of the container. </li>
 696
 697         <li> Start the container in foreground mode with <code>micoforia --
 698         start -F c1</code>. </li>
 699
 700         <li> Run <code>micoforia stop c1</code> to stop the container,
 701         edit <code>~/.micoforiarc</code> and add the following two lines to
 702         configure  memory and CPU limits.
 703
 704         <pre>
 705                 cpu-cores c1:1
 706                 memory-limit c1:1
 707         </pre>
 708
 709         Start the container and run suitable commands which show that the
 710         newly configured limits are in effect. </li>
 711
 712         <li> While the container is running, investigate the control files
 713         of the cgroup pseudo file systems. Identify the pseudo files which
 714         control the CPU and memory limit of the container. </li>
 715
 716         <li> Come up with suitable commands to change the CPU and memory
 717         limits of the container while it is running. </li>
 718
 719         <li> On the host system, create a loop device and a file system on
 720         it. Mount the file system on a subdirectory of the root file system
 721         of the container. Note that the mount is not visible from within the
 722         container. Come up with a way to make such mounts visible without
 723         restarting the container. </li>
 724 </ul>
 725
 726 SUPPLEMENTS()
 727
 728 SUBSECTION(«UTS Namespace Example»)
 729 <pre>
 730         <code>
 731                 #define _GNU_SOURCE
 732                 #include &lt;sys/utsname.h&gt;
 733                 #include &lt;sched.h&gt;
 734                 #include &lt;stdio.h&gt;
 735                 #include &lt;stdlib.h&gt;
 736                 #include &lt;unistd.h&gt;
 737
 738                 static void print_hostname_and_exit(const char *pfx)
 739                 {
 740                         struct utsname uts;
 741
 742                         uname(&uts);
 743                         printf("%s: %s\n", pfx, uts.nodename);
 744                         exit(EXIT_SUCCESS);
 745                 }
 746
 747                 static int child(void *arg)
 748                 {
 749                         sethostname("jesus", 5);
 750                         print_hostname_and_exit("child");
 751                 }
 752
 753                 #define STACK_SIZE (64 * 1024)
 754                 static char child_stack[STACK_SIZE];
 755
 756                 int main(int argc, char *argv[])
 757                 {
 758                         clone(child, child_stack + STACK_SIZE, CLONE_NEWUTS, NULL);
 759                         print_hostname_and_exit("parent");
 760                 }
 761         </code>
 762 </pre>
 763
 764 SUBSECTION(«PID Namespace Example»)
 765 <pre>
 766         <code>
 767                 #define _GNU_SOURCE
 768                 #include &lt;sched.h&gt;
 769                 #include &lt;unistd.h&gt;
 770                 #include &lt;stdlib.h&gt;
 771                 #include &lt;stdio.h&gt;
 772
 773                 static int child(void *arg)
 774                 {
 775                         printf("PID: %d, PPID: %d\n", (int)getpid(), (int)getppid());
 776                 }
 777
 778                 #define STACK_SIZE (64 * 1024)
 779                 static char child_stack[STACK_SIZE];
 780
 781                 int main(int argc, char *argv[])
 782                 {
 783                         pid_t pid = clone(child, child_stack + STACK_SIZE, CLONE_NEWPID, NULL);
 784                         printf("child PID: %d\n", (int)pid);
 785                         exit(EXIT_SUCCESS);
 786                 }
 787         </code>
 788 </pre>
 789
 790 SECTION(«Further Reading»)
 791 <ul>
 792         <li> <a href="https://lwn.net/Articles/782876/">The creation of the
 793         io.latency block I/O controller</a>, by Josef Bacik: </li>
 794 </ul>