TITLE(« Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it. -- Perlis's Programming Proverb #58 (1982) », __file__) OVERVIEW(« In general, virtualization refers to the abstraction of computer resources. This chapter is primarily concerned with server virtualization, a concept which makes it possible to run more than one operating system simultaneously and independently of each other on a single physical computer. We first describe the different virtualization frameworks but quickly specialize on Linux OS-level virtualization and their virtual machines called containers. Container platforms for Linux are built on top of namespaces and control groups, the low-level kernel features which implement abstraction and isolation of processes. We look at both concepts in some detail. The final section discusses micoforia, a minimal container platform. ») SECTION(«Virtualization Frameworks») The origins of server virtualization date back to the 1960s. The first virtual machine was created as a collaboration between IBM (International Business Machines) and the MIT (Massachusetts Institute of Technology). Since then, many different approaches have been designed, resulting in several Virtualization Frameworks. All frameworks promise to improve resource utilization and availability, to reduce costs, and to provide greater flexibility. While some of these benefits might be real, they do not come for free. Their costs include: the host becomes a single point of failure, decreased performance, added complexity and increased maintenance costs due to extensive debugging, documentation, and maintenance of the VMs. This chapter briefly describes the three main virtualization frameworks. We list the advantages and disadvantages of each and give some examples. SUBSECTION(«Software Virtualization (Emulation)») This virtualization framework does not play a significant role in server virtualization, it is only included for completeness. Emulation means to imitate a complete hardware architecture in software, including peripheral devices. All CPU instructions and hardware interrupts are interpreted by the emulator rather than being run by native hardware. Since this approach has a large performance penalty, it is only suitable when speed is not critical. For this reason, emulation is typically employed for ancient hardware like arcade game systems and home computers such as the Commodore 64. Despite the performance penalty, emulation is valuable because it allows applications and operating systems to run on the current platform as they did in their original environment. Examples: Bochs, Mame, VICE. SUBSECTION(«Paravirtualization and Hardware-Assisted Virtualization») These virtualization frameworks are characterized by the presence of a hypervisor, also known as Virtual Machine Monitor, which translates system calls from the VMs to native hardware requests. In contrast to Software Virtualization, the host OS does not emulate hardware resources but offers a special APIs to the VMs. If the presented interface is different to that of the underlying hardware, the term paravirtualization is used. The guest OS then has to be modified to include modified (paravirtualized) drivers. In 2005 AMD and Intel added hardware virtualization instructions to the CPUs and IOMMUs (Input/Output memory management units) to the chipsets. This allowed VMs to directly execute privileged instructions and use peripheral devices. This so-called Hardware-Assisted Virtualization allows unmodified operating systems to run on the VMs. The main advantage of Hardware-Assisted Virtualization is its flexibility, as the host OS does not need to match the OS running on the VMs. The disadvantages are hardware compatibility constraints and performance loss. Although these days all hardware has virtualization support, there are still significant differences in performance between the host and the VM. Moreover, peripheral devices like storage hardware has to be compatible with the chipset to make use of the IOMMU. Examples: KVM (with QEMU as hypervisor), Xen, UML SUBSECTION(«OS-level Virtualization (Containers)») OS-level Virtualization is a technique for lightweight virtualization. The abstractions are built directly into the kernel and no hypervisor is needed. In this context the term "virtual machine" is inaccurate, which is why the OS-level VMs are called differently in this context. On Linux, they are called containers, other operating systems call them jails or zones. We shall exclusively use "container" from now on. All containers share a single kernel, so the OS running in the container has to match the host OS. However, each container has its own root file system, so containers can differ in user space. For example, different containers can run different Linux distributions. Since programs running in a container use the normal system call interface to communicate with the kernel, OS-level Virtualization does not require hardware support for efficient performance. In fact, OS-level Virtualization imposes no overhead at all. OS-level Virtualization is superior to the alternatives because of its simplicity and its performance. The only disadvantage is the lack of flexibility. It is simply not an option if some of the VMs must run different operating systems than the host. Examples: LXC, Micoforia, Singularity, Docker. EXERCISES() HOMEWORK(« ») SECTION(«Namespaces») Namespaces partition the set of processes into disjoint subsets with local scope. Where the traditional Unix systems provided only a single system-wide resource shared by all processes, the namespace abstractions make it possible to give processes the illusion of living in their own isolated instance. Linux implements the following six different types of namespaces: mount (Linux-2.4.x, 2002), IPC (Linux-2.6.19, 2006), UTS (Linux-2.6.19, 2006), PID (Linux-2.6.24, 2008), network (Linux-2.6.29, 2009), UID (Linux-3.8, 2013). For OS-level virtualization all six name space types are typically employed to make the containers look like independent systems. Before we look at each namespace type, we briefly describe how namespaces are created and how information related to namespaces can be obtained for a process. SUBSECTION(«Namespace API»)

Initially, there is only a single namespace of each type called the root namespace. All processes belong to this namespace. The clone(2) system call is a generalization of the classic fork(2) which allows privileged users to create new namespaces by passing one or more of the six NEW_ flags. The child process is made a member of the new namespace. Calling plain fork(2) or clone(2) with no NEW_* flag lets the newly created process inherit the namespaces from its parent. There are two additional system calls, setns(2) and unshare(2) which both change the namespace(s) of the calling process without creating a new process. For the latter, there is a user command, also called unshare(1) which makes the namespace API available to scripts.

The /proc/$PID directory of each process contains a ns subdirectory which contains one file per namespace type. The inode number of this file is the namespace ID. Hence, by running stat(1) one can tell whether two different processes belong to the same namespace. Normally a namespace ceases to exist when the last process in the namespace terminates. However, by opening /proc/$PID/ns/$TYPE one can prevent the namespace from disappearing.

SUBSECTION(«UTS Namespaces») UTS is short for UNIX Time-sharing System. The old fashioned word "Time-sharing" has been replaced by multitasking but the old name lives on in the uname(2) system call which fills out the fields of a struct utsname. On return the nodename field of this structure contains the hostname which was set by a previous call to sethostname(2). Similarly, the domainname field contains the string that was set with setdomainname(2). UTS namespaces provide isolation of these two system identifiers. That is, processes in different UTS namespaces might see different host- and domain names. Changing the host- or domainname affects only processes which belong to the same UTS namespace as the process which called sethostname(2) or setdomainname(2). SUBSECTION(«Mount Namespaces») The mount namespaces are the oldest Linux namespace type. This is kind of natural since they are supposed to overcome well-known limitations of the venerable chroot(2) system call which was introduced in 1979. Mount namespaces isolate the mount points seen by processes so that processes in different mount namespaces can have different views of the file system hierarchy. Like for other namespace types, new mount namespaces are created by calling clone(2) or unshare(2). The new mount namespace starts out with a copy of the caller's mount point list. However, with more than one mount namespace the mount(2) and umount(2) system calls no longer operate on a global set of mount points. Whether or not a mount or unmount operation has an effect on processes in different mount namespaces than the caller's is determined by the configurable mount propagation rules. By default, modifications to the list of mount points have only affect the processes which are in the same mount namespace as the process which initiated the modification. This setting is controlled by the propagation type of the mount point. Besides the obvious private and shared types, there is also the MS_SLAVE propagation type which lets mount and unmount events propagate from from a "master" to its "slaves" but not the other way round. SUBSECTION(«Network Namespaces») Network namespaces not only partition the set of processes, as all six namespace types do, but also the set of network interfaces. That is, each physical or virtual network interface belongs to one (and only one) network namespace. Initially, all interfaces are in the root network namespace. This can be changed with the command ip link set iface netns PID. Processes only see interfaces whose network namespace matches the one they belong to. This lets processes in different network namespaces have different ideas about which network devices exist. Each network namespace has its own IP stack, IP routing table and TCP and UDP ports. This makes it possible to start, for example, many sshd(8) processes which all listen on "their own" TCP port 22. An OS-level virtualization framework typically leaves physical interfaces in the root network namespace but creates a dedicated network namespace and a virtual interface pair for each container. One end of the pair is left in the root namespace while the other end is configured to belong to the dedicated namespace, which contains all processes of the container. SUBSECTION(«PID Namespaces») This namespace type allows a process to have more than one process ID. Unlike network interfaces which disappear when they enter a different network namespace, a process is still visible in the root namespace after it has entered a different PID namespace. Besides its existing PID it gets a second PID which is only valid inside the target namespace. Similarly, when a new PID namespace is created by passing the CLONE_NEWPID flag to clone(2), the child process gets some unused PID in the original PID namepspace but PID 1 in the new namespace. As as consequence, processes in different PID namespaces can have the same PID. In particular, there can be arbitrary many "init" processes, which all have PID 1. The usual rules for PID 1 apply within each PID namespace. That is, orphaned processes are reparented to the init process, and it is a fatal error if the init process terminates, causing all processes in the namespace to terminate as well. PID namespaces can be nested, but under normal circumstances they are not. So we won't discuss nesting. Since each process in a non-root PID namespace has also a PID in the root PID namespace, processes in the root PID namespace can "see" all processes but not vice versa. Hence a process in the root namespace can send signals to all processes while processes in the child namespace can only send signals to processes in their own namespace. Processes can be moved from the root PID namespace into a child PID namespace but not the other way round. Moreover, a process can instruct the kernel to create subsequent child processes in a different PID namespace. SUBSECTION(«User Namespaces») User namespaces have been implemented rather late compared to other namespace types. The implementation was completed in 2013. The purpose of user namespaces is to isolate user and group IDs. Initially there is only one user namespace, the initial namespace to which all processes belong. As with all namespace types, a new user namespace is created with unshare(2) or clone(2). The UID and GID of a process can be different in different namespaces. In particular, an unprivileged process may have UID 0 inside an user namespace. When a process is created in a new namespace or a process joins an existing user namespace, it gains full privileges in this namespace. However, the process has no additional privileges in the parent/previous namespace. Moreover, a certain flag is set for the process which prevents the process from entering yet another namespace with elevated privileges. In particular it does not keep its privileges when it returns to its original namespace. User namespaces can be nested, but we don't discuss nesting here. Each user namespace has an owner, which is the effective user ID (EUID) of the process which created the namespace. Any process in the root user namespace whose EUID matches the owner ID has all capabilities in the child namespace. If CLONE_NEWUSER is specified together with other CLONE_NEW* flags in a single clone(2) or unshare(2) call, the user namespace is guaranteed to be created first, giving the child/caller privileges over the remaining namespaces created by the call. It is possible to map UIDs and GIDs between namespaces. The /proc/$PID/uid_map and /proc/$PID/gid_map files are used to get and set the mappings. We will only talk about UID mappings in the sequel because the mechanism for the GID mappings are analogous. When the /proc/$PID/uid_map (pseudo-)file is read, the contents are computed on the fly and depend on both the user namespace to which process $PID belongs and the user namespace of the calling process. Each line contains three numbers which specify the mapping for a range of UIDs. The numbers have to be interpreted in one of two ways, depending on whether the two processes belong to the same user namespace or not. All system calls which deal with UIDs transparently translate UIDs by consulting these maps. A map for a newly created namespace is established by writing UID-triples once to one uid_map file. Subsequent writes will fail. SUBSECTION(«IPC Namespaces») System V inter process communication (IPC) subsumes three different mechanisms which enable unrelated processes to communicate with each other. These mechanisms, known as message queues, semaphores and shared memory, predate Linux by at least a decade. They are mandated by the POSIX standard, so every Unix system has to implement the prescribed API. The common characteristic of the System V IPC mechanisms is that their objects are addressed by system-wide IPC identifiers rather than by pathnames. IPC namespaces isolate these resources so that processes in different IPC namespaces have different views of the existing IPC identifiers. When a new IPC namespace is created, it starts out with all three identifier sets empty. Newly created IPC objects are only visible for processes which belong to the same IPC namespace as the process which created the object. EXERCISES() HOMEWORK(« The shmctl(2) system call performs operations on a System V shared memory segment. It operates on a shmid_ds structure which contains in the shm_lpid field the PID of the process which last attached or detached the segment. Describe the implications this API detail has on the interaction between IPC and PID namespaces. ») SECTION(«Control Groups») Control groups (cgroups) allow processes to be grouped and organized hierarchically in a tree. Each control group contains processes which can be monitored or controlled as a unit, for example by limiting the resources they can occupy. Several controllers exist (CPU, memory, I/O, etc.), some of which actually impose control while others only provide identification and relay control to separate mechanisms. Unfortunately, control groups are not easy to understand because the controllers are implemented in an inconsistent way and because of the rather chaotic relationship between them. In 2014 it was decided to rework the cgroup subsystem of the Linux kernel. To keep existing applications working, the original cgroup implementation, now called cgroup-v1, was retained and a second, incompatible, cgroup implementation was designed. Cgroup-v2 aims to address the shortcomings of the first version, including its inefficiency, inconsistency and the lack of interoperability among controllers. The cgroup-v2 API was made official in 2016. Version 1 continues to work even if both implementations are active. Both cgroup implementations provide a pseudo file system that must be mounted in order to define and configure cgroups. The two pseudo file systems may be mounted at the same time (on different mountpoints). For both cgroup versions, the standard mkdir(2) system call creates a new cgroup. To add a process to a cgroup one must write its PID to one of the files in the pseudo file system. We will cover both cgroup versions because as of 2018-11 many applications still rely on cgroup-v1 and cgroup-v2 still lacks some of the functionality of cgroup-v1. However, we will not look at all controllers. SUBSECTION(«CPU controllers») These controllers regulate the distribution of CPU cycles. The cpuset controller of cgroup-v1 is the oldest cgroup controller, it was implemented before the cgroups-v1 subsystem existed, which is why it provides its own pseudo file system which is usually mounted at /dev/cpuset. This file system is only kept for backwards compability and is otherwise equivalent to the corresponding part of the cgroup pseudo file system. The cpuset controller links subsets of CPUs to cgroups so that the processes in a cgroup are confined to run only on the CPUs of "their" subset. The CPU controller of cgroup-v2, which is simply called "cpu", works differently. Instead of specifying the set of admissible CPUs for a cgroup, one defines the ratio of CPU cycles for the cgroup. Work to support CPU partitioning as the cpuset controller of cgroup-v1 is in progress and expected to be ready in 2019. SUBSECTION(«Devices») The device controller of cgroup-v1 imposes mandatory access control for device-special files. It tracks the open(2) and mknod(2) system calls and enforces the restrictions defined in the device access whitelist of the cgroup the calling process belongs to. Processes in the root cgroup have full permissions. Other cgroups inherit the device permissions from their parent. A child cgroup never has more permission than its parent. Cgroup-v2 takes a completely different approach to device access control. It is implemented on top of BPF, the Berkeley packet filter. Hence this controller is not listed in the cgroup-v2 pseudo file system. SUBSECTION(«Freezer») Both cgroup-v1 and cgroup-v2 implement a freezer controller, which provides an ability to stop ("freeze") all processes in a cgroup to free up resources for other tasks. The stopped processes can be continued ("thawed") as a unit later. This is similar to sending SIGSTOP/SIGCONT to all processes, but avoids some problems with corner cases. The v2 version was added in 2019-07. It is available from Linux-5.2 onwards. SUBSECTION(«Memory») Cgroup-v1 offers three controllers related to memory management. First there is the cpusetcontroller described above which can be instructed to let processes allocate only memory which is close to the CPUs of the cpuset. This makes sense on NUMA (non-uniform memory access) systems where the memory access time for a given CPU depends on the memory location. Second, the hugetlb controller manages distribution and usage of huge pages. Third, there is the memory resource controller which provides a number of files in the cgroup pseudo file system to limit process memory usage, swap usage and the usage of memory by the kernel on behalf of the process. The most important tunable of the memory resource controller is limit_in_bytes. The cgroup-v2 version of the memory controller is rather more complex because it attempts to limit direct and indirect memory usage of the processes in a cgroup in a bullet-proof way. It is designed to restrain even malicious processes which try to slow down or crash the system by indirectly allocating memory. For example, a process could try to create many threads or file descriptors which all cause a (small) memory allocation in the kernel. Besides several tunables and statistics, the memory controller provides the memory.events file whose contents change whenever a state transition for the cgroup occurs, for example when processes are started to get throttled because the high memory boundary was exceeded. This file could be monitored by a management agent to take appropriate actions. The main mechanism to control the memory usage is the memory.high file. SUBSECTION(«I/O») I/O controllers regulate the distribution of IO resources among cgroups. The throttling policy of cgroup-v2 can be used to enforce I/O rate limits on arbitrary block devices, for example on a logical volume provided by the logical volume manager (LVM). Read and write bandwidth may be throttled independently. Moreover, the number of IOPS (I/O operations per second) may also be throttled. The I/O controller of cgroup-v1 is called blkio while for cgroup-v2 it is simply called io. The features of the v1 and v2 I/O controllers are identical but the filenames of the pseudo files and the syntax for setting I/O limits differ. The exercises ask the reader to try out both versions. There is no cgroup-v2 controller for multi-queue schedulers so far. However, there is the I/O Latency controller for cgroup-v2 which works for arbitrary block devices and all I/O schedulers. It features I/O workload protection for the processes in a cgroup. This works by throttling the processes in cgroups that have a lower latency target than those in the protected cgroup. The throttling is performed by lowering the depth of the request queue of the affected devices. EXERCISES() HOMEWORK(« ») SECTION(«Linux Containers»)

Containers provide resource management through control groups and resource isolation through namespaces. A container platform is thus a software layer implemented on top of these features. Given a directory containing a Linux root file system, starting the container is a simple matter: First clone(2) is called with the proper NEW_* flags to create a new process in a suitable set of namespaces. The child process then creates a cgroup for the container and puts itself into it. The final step is to let the child process hand over control to the container's /sbin/init by calling exec(2). When the last process in the newly created namespaces exits, the namespaces disappear and the parent process removes the cgroup. The details are a bit more complicated, but the above covers the essence of what the container startup command has to do.

Many container platforms offer additional features not to be discussed here, like downloading and unpacking a file system image from the internet, or supplying the root file system for the container by other means, for example by creating an LVM snapshot of a master image. In this section we look at micoforia, a minimalistic container platform to boot a container from an existing root file system as described above.

The containers known to micoforia are defined in the single ~/.micoforiarc configuration file whose format is described in micoforia(8). The micoforia command supports various subcommands to maintain containers. For example, containers are started with a command such as micoforia start c1 where c1 is the name of the container. One can execute a shell running within the container with micoforia enter c1, log in to a local pseudo terminal with micoforia attach c1, or connect via ssh with ssh c1. Of course the latter command only works if the network interface and the DNS record get configured during container startup and the sshd package is installed. The container can be stopped by executing halt from within the container, or by running micoforia stop c1 on the host system. The commands micoforia ls and micoforia ps print information about containers and their processes.

The exercises ask the reader to install the micoforia package from source, and to set up a minimal container running Ubuntu Linux.

EXERCISES() SUPPLEMENTS() SUBSECTION(«UTS Namespace Example»)
	
		#define _GNU_SOURCE
		#include <sys/utsname.h>
		#include <sched.h>
		#include <stdio.h>
		#include <stdlib.h>
		#include <unistd.h>

		static void print_hostname_and_exit(const char *pfx)
		{
			struct utsname uts;

			uname(&uts);
			printf("%s: %s\n", pfx, uts.nodename);
			exit(EXIT_SUCCESS);
		}

		static int child(void *arg)
		{
			sethostname("jesus", 5);
			print_hostname_and_exit("child");
		}

		#define STACK_SIZE (64 * 1024)
		static char child_stack[STACK_SIZE];

		int main(int argc, char *argv[])
		{
			clone(child, child_stack + STACK_SIZE, CLONE_NEWUTS, NULL);
			print_hostname_and_exit("parent");
		}
	
SUBSECTION(«PID Namespace Example»)
	
		#define _GNU_SOURCE
		#include <sched.h>
		#include <unistd.h>
		#include <stdlib.h>
		#include <stdio.h>

		static int child(void *arg)
		{
			printf("PID: %d, PPID: %d\n", (int)getpid(), (int)getppid());
		}

		#define STACK_SIZE (64 * 1024)
		static char child_stack[STACK_SIZE];

		int main(int argc, char *argv[])
		{
			pid_t pid = clone(child, child_stack + STACK_SIZE, CLONE_NEWPID, NULL);
			printf("child PID: %d\n", (int)pid);
			exit(EXIT_SUCCESS);
		}
	
SECTION(«Further Reading»)