TITLE(«
Fools ignore complexity. Pragmatists suffer it. Some can avoid it.
Geniuses remove it. -- Perlis's Programming Proverb #58 (1982)
», __file__)


OVERVIEW(«

In general, virtualization refers to the abstraction of computer
resources. This chapter is primarily concerned with <em> server
virtualization</em>, a concept which makes it possible to run
more than one operating system simultaneously and independently
of each other on a single physical computer.  We first describe
the different virtualization frameworks but quickly specialize on
Linux OS-level virtualization and their virtual machines called <em>
containers</em>. Container platforms for Linux are built on top of
<em>namespaces</em> and <em>control groups</em>, the low-level kernel
features which implement abstraction and isolation of processes. We
look at both concepts in some detail.  One of the earliest container
platforms for Linux is <em> LXC </em> (Linux containers) which is
discussed in a dedicated section.

»)

SECTION(«Virtualization Frameworks»)

The origins of server virtualization date back to the 1960s. The
first virtual machine was created as a collaboration between IBM
(International Business Machines) and the MIT (Massachusetts Institute
of Technology). Since then, many different approaches have been
designed, resulting in several <em> Virtualization Frameworks</em>. All
frameworks promise to improve resource utilization and availability, to
reduce costs, and to provide greater flexibility. While some of these
benefits might be real, they do not come for free. Their costs include:
the host becomes a single point of failure, decreased performance,
added complexity and increased maintenance costs due to extensive
debugging, documentation, and maintenance of the VMs. This chapter
briefly describes the three main virtualization frameworks. We list
the advantages and disadvantages of each and give some examples.

SUBSECTION(«Software Virtualization (Emulation)»)

This virtualization framework does not play a significant role in
server virtualization, it is only included for completeness. Emulation
means to imitate a complete hardware architecture in software,
including peripheral devices. All CPU instructions and hardware
interrupts are interpreted by the emulator rather than being run by
native hardware. Since this approach has a large performance penalty,
it is only suitable when speed is not critical. For this reason,
emulation is typically employed for ancient hardware like arcade
game systems and home computers such as the Commodore 64. Despite
the performance penalty, emulation is valuable because it allows
applications and operating systems to run on the current platform as
they did in their original environment.

Examples: Bochs, Mame, VICE.

SUBSECTION(«Paravirtualization and Hardware-Assisted Virtualization»)

These virtualization frameworks are characterized by the presence
of a <em> hypervisor</em>, also known as <em> Virtual Machine
Monitor</em>, which translates system calls from the VMs to native
hardware requests. In contrast to Software Virtualization, the
host OS does not emulate hardware resources but offers a special
APIs to the VMs. If the presented interface is different to that
of the underlying hardware, the term <em> paravirtualization </em>
is used. The guest OS then has to be modified to include modified
(paravirtualized) drivers. In 2005 AMD and Intel added hardware
virtualization instructions to the CPUs and IOMMUs (Input/Output memory
management units) to the chipsets. This allowed VMs to directly execute
privileged instructions and use peripheral devices. This so-called <em>
Hardware-Assisted Virtualization </em> allows unmodified operating
systems to run on the VMs.

The main advantage of Hardware-Assisted Virtualization is its
flexibility, as the host OS does not need to match the OS running on
the VMs. The disadvantages are hardware compatibility constraints and
performance loss. Although these days all hardware has virtualization
support, there are still significant differences in performance between
the host and the VM. Moreover, peripheral devices like storage hardware
has to be compatible with the chipset to make use of the IOMMU.

Examples: KVM (with QEMU as hypervisor), Xen, UML

SUBSECTION(«OS-level Virtualization (Containers)»)

OS-level Virtualization is a technique for lightweight virtualization.
The abstractions are built directly into the kernel and no
hypervisor is needed. In this context the term "virtual machine" is
inaccurate, which is why the OS-level VMs are called differently in
this context. On Linux, they are called <em> containers</em>, other
operating systems call them <em> jails </em> or <em> zones</em>. We
shall exclusively use "container" from now on. All containers share
a single kernel, so the OS running in the container has to match the
host OS. However, each container has its own root file system, so
containers can differ in user space. For example, different containers
can run different Linux distributions. Since programs running in a
container use the normal system call interface to communicate with
the kernel, OS-level Virtualization does not require hardware support
for efficient performance. In fact, OS-level Virtualization imposes
no overhead at all.

OS-level Virtualization is superior to the alternatives because of its
simplicity and its performance. The only disadvantage is the lack of
flexibility. It is simply not an option if some of the VMs must run
different operating systems than the host.

Examples: LXC, Singularity, Docker.

EXERCISES()

<ul>

	<li> On any Linux system, check if the processor supports virtualization
	by running <code> cat /proc/cpuinfo</code>. Hint: svm and vmx. </li>

	<li> Hypervisors come in two flavors called <em> native </em> and <em>
	hosted</em>. Explain the difference and the pros and cons of either
	flavor. Is QEMU a native or a hosted hypervisor? </li>

	<li> Scan through chapter 15 (Secure Virtual Machine) of the

		<a href="https://www.amd.com/system/files/TechDocs/24593.pdf">AMD Programmer's Manual</a>

	to get an idea of the complexity of Hardware-Assisted
	Virtualization. </li>

</ul>

HOMEWORK(«

<ul>
	<li> Recall the concept of <em> direct memory access </em> (DMA)
	and explain why DMA is a problem for virtualization. Which of the
	three virtualization frameworks of this chapter are affected by this
	problem? </li>

	<li> Compare AMD's <em> Rapid Virtualization Indexing </em> to Intel's
	<em> Extended Page Tables</em>. </li>

	<li> Suppose a hacker gained root access to a VM and wishes to proceed
	from there to get also full control over the host OS. Discuss the thread
	model in the context of the three virtualization frameworks covered
	in this section. </li>

</ul>
»)

SECTION(«Namespaces»)

Namespaces partition the set of processes into disjoint subsets
with local scope. Where the traditional Unix systems provided only
a single system-wide resource shared by all processes, the namespace
abstractions make it possible to give processes the illusion of living
in their own isolated instance.  Linux implements the following
six different types of namespaces: mount (Linux-2.4.x, 2002), IPC
(Linux-2.6.19, 2006), UTS (Linux-2.6.19, 2006), PID (Linux-2.6.24,
2008), network (Linux-2.6.29, 2009), UID (Linux-3.8, 2013).
For OS-level virtualization all six name space types are typically
employed to make the containers look like independent systems.

Before we look at each namespace type, we briefly describe how
namespaces are created and how information related to namespaces can
be obtained for a process.

SUBSECTION(«Namespace API»)

<p> Initially, there is only a single namespace of each type called the
<em> root namespace</em>. All processes belong to this namespace. The
<code> clone(2) </code> system call is a generalization of the classic
<code> fork(2) </code> which allows privileged users to create new
namespaces by passing one or more of the six <code> NEW_ </code>
flags. The child process is made a member of the new namespace. Calling
plain <code> fork(2) </code> or <code> clone(2) </code> with no
<code> NEW_* </code> flag lets the newly created process inherit the
namespaces from its parent. There are two additional system calls,
<code> setns(2) </code> and <code> unshare(2) </code> which both
change the namespace(s) of the calling process without creating a
new process. For the latter, there is a user command, also called
<code> unshare(1) </code> which makes the namespace API available to
scripts. </p>

<p> The <code> /proc/$PID </code> directory of each process contains a
<code> ns </code> subdirectory which contains one file per namespace
type. The inode number of this file is the <em> namespace ID</em>.
Hence, by running <code> stat(1) </code> one can tell whether
two different processes belong to the same namespace. Normally a
namespace ceases to exist when the last process in the namespace
terminates. However, by opening <code> /proc/$PID/ns/$TYPE </code>
one can prevent the namespace from disappearing. </p>

SUBSECTION(«UTS Namespaces»)

UTS is short for <em> UNIX Time-sharing System</em>. The old fashioned
word "Time-sharing" has been replaced by <em> multitasking</em>
but the old name lives on in the <code> uname(2) </code> system
call which fills out the fields of a <code> struct utsname</code>.
On return the <code> nodename </code> field of this structure
contains the hostname which was set by a previous call to <code>
sethostname(2)</code>. Similarly, the <code> domainname </code> field
contains the string that was set with <code> setdomainname(2)</code>.

UTS namespaces provide isolation of these two system identifiers. That
is, processes in different UTS namespaces might see different host- and
domain names. Changing the host- or domainname affects only processes
which belong to the same UTS namespace as the process which called
<code> sethostname(2) </code> or <code> setdomainname(2)</code>.

SUBSECTION(«Mount Namespaces»)

The <em> mount namespaces </em> are the oldest Linux namespace
type. This is kind of natural since they are supposed to overcome
well-known limitations of the venerable <code> chroot(2) </code>
system call which was introduced in 1979. Mount namespaces isolate
the mount points seen by processes so that processes in different
mount namespaces can have different views of the file system hierarchy.

Like for other namespace types, new mount namespaces are created by
calling <code> clone(2) </code> or <code> unshare(2)</code>. The
new mount namespace starts out with a copy of the caller's mount
point list.  However, with more than one mount namespace the <code>
mount(2) </code> and <code> umount(2) </code> system calls no longer
operate on a global set of mount points. Whether or not a mount
or unmount operation has an effect on processes in different mount
namespaces than the caller's is determined by the configurable <em>
mount propagation </em> rules. By default, modifications to the list
of mount points have only affect the processes which are in the same
mount namespace as the process which initiated the modification. This
setting is controlled by the <em> propagation type </em> of the
mount point. Besides the obvious private and shared types, there is
also the <code> MS_SLAVE </code> propagation type which lets mount
and unmount events propagate from from a "master" to its "slaves"
but not the other way round.

SUBSECTION(«Network Namespaces»)

Network namespaces not only partition the set of processes, as all
six namespace types do, but also the set of network interfaces. That
is, each physical or virtual network interface belongs to one (and
only one) network namespace. Initially, all interfaces are in the
root network namespace. This can be changed with the command <code>
ip link set iface netns PID</code>. Processes only see interfaces
whose network namespace matches the one they belong to. This lets
processes in different network namespaces have different ideas about
which network devices exist. Each network namespace has its own IP
stack, IP routing table and TCP and UDP ports. This makes it possible
to start, for example, many <code> sshd(8) </code> processes which
all listen on "their own" TCP port 22.

An OS-level virtualization framework typically leaves physical
interfaces in the root network namespace but creates a dedicated
network namespace and a virtual interface pair for each container. One
end of the pair is left in the root namespace while the other end is
configured to belong to the dedicated namespace, which contains all
processes of the container.

SUBSECTION(«PID Namespaces»)

This namespace type allows a process to have more than one process
ID. Unlike network interfaces which disappear when they enter a
different network namespace, a process is still visible in the root
namespace after it has entered a different PID namespace. Besides its
existing PID it gets a second PID which is only valid inside the target
namespace. Similarly, when a new PID namespace is created by passing
the <code> CLONE_NEWPID </code> flag to <code> clone(2)</code>, the
child process gets some unused PID in the original PID namepspace
but PID 1 in the new namespace.

As as consequence, processes in different PID namespaces can have the
same PID. In particular, there can be arbitrary many "init" processes,
which all have PID 1. The usual rules for PID 1 apply within each PID
namespace. That is, orphaned processes are reparented to the init
process, and it is a fatal error if the init process terminates,
causing all processes in the namespace to terminate as well. PID
namespaces can be nested, but under normal circumstances they are
not. So we won't discuss nesting.

Since each process in a non-root PID namespace has also a PID in the
root PID namespace, processes in the root PID namespace can "see" all
processes but not vice versa. Hence a process in the root namespace can
send signals to all processes while processes in the child namespace
can only send signals to processes in their own namespace.

Processes can be moved from the root PID namespace into a child
PID namespace but not the other way round. Moreover, a process can
instruct the kernel to create subsequent child processes in a different
PID namespace.

SUBSECTION(«User Namespaces»)

User namespaces have been implemented rather late compared to other
namespace types. The implementation was completed in 2013. The purpose
of user namespaces is to isolate user and group IDs. Initially there
is only one user namespace, the <em> initial namespace </em> to which
all processes belong. As with all namespace types, a new user namespace
is created with <code> unshare(2) </code> or <code> clone(2)</code>.

The UID and GID of a process can be different in different
namespaces. In particular, an unprivileged process may have UID
0 inside an user namespace. When a process is created in a new
namespace or an process joins an existing user namespace, it gains full
privileges in this namespace. However, the process has no additional
privileges in the parent/previous namespace. Moreover, a certain flag
is set for the process which prevents the process from entering yet
another namespace with elevated privileges. In particular it does not
keep its privileges when it returns to its original namespace. User
namespaces can be nested, but we don't discuss nesting here.

Each user namespace has an <em> owner</em>, which is the effective user
ID (EUID) of the process which created the namespace. Any process
in the root user namespace whose EUID matches the owner ID has all
capabilities in the child namespace.

If <code> CLONE_NEWUSER </code> is specified together with other
<code> CLONE_NEW* </code> flags in a single <code> clone(2) </code>
or <code> unshare(2) </code> call, the user namespace is guaranteed
to be created first, giving the child/caller privileges over the
remaining namespaces created by the call.

It is possible to map UIDs and GIDs between namespaces.  The <code>
/proc/$PID/uid_map </code> and <code> /proc/$PID/gid_map </code> files
are used to get and set the mappings. We will only talk about UID
mappings in the sequel because the mechanism for the GID mappings are
analogous. When the <code> /proc/$PID/uid_map </code> (pseudo-)file is
read, the contents are computed on the fly and depend on both the user
namespace to which process <code> $PID </code> belongs and the user
namespace of the calling process. Each line contains three numbers
which specify the mapping for a range of UIDs. The numbers have
to be interpreted in one of two ways, depending on whether the two
processes belong to the same user namespace or not. All system calls
which deal with UIDs transparently translate UIDs by consulting these
maps. A map for a newly created namespace is established by writing
UID-triples <em> once </em> to <em> one </em> <code> uid_map </code>
file. Subsequent writes will fail.

SUBSECTION(«IPC Namespaces»)

System V inter process communication (IPC) subsumes three different
mechanisms which enable unrelated processes to communicate with each
other. These mechanisms, known as <em> message queues</em>, <em>
semaphores </em> and <em> shared memory</em>, predate Linux by at
least a decade. They are mandated by the POSIX standard, so every Unix
system has to implement the prescribed API. The common characteristic
of the System V IPC mechanisms is that their objects are addressed
by system-wide IPC <em> identifiers</em> rather than by pathnames.

IPC namespaces isolate these resources so that processes in different
IPC namespaces have different views of the existing IPC identifiers.
When a new IPC namespace is created, it starts out with all three
identifier sets empty. Newly created IPC objects are only visible
for processes which belong to the same IPC namespace as the process
which created the object.

EXERCISES()

<ul>

	<li> Examine <code> /proc/$$/mounts</code>,
	<code>/proc/$$/mountinfo</code>, and <code>/proc/$$/mountstats</code>.
	</li>

	<li> Recall the concept of a <em> bind mount</em>. Describe the
	sequence of mount operations a container implementation would need
	to perform in order to set up a container whose root file system
	is mounted on, say, <code> /mnt </code> before the container is
	started. </li>

	<li> What should happen on the attempt to change a read-only mount
	to be read-write from inside of a container? </li>
	<li> Compile and run <code> <a
	href="#uts_namespace_example">utc-ns.c</a></code>, a  minimal C
	program which illustrates how to create a new UTS namespace. Explain
	each line of the source code. </li>

	<li> Run <code> ls -l /proc/$$/ns </code> to see the namespaces of
	the shell.  Run <code> stat -L /proc/$$/ns/uts </code> and confirm
	that the inode number coincides with the number shown in the target
	of the link of the <code> ls </code> output.

	<li> Discuss why creating a namespace is a privileged operation. </li>

	<li> What is the parent process ID of the init process? Examine the
	fourth field of <code> /proc/1/stat </code> to confirm. </li>

	<li> It is possible for a process in a PID namespace to have a parent
	which is outside of this namespace. This is certainly the case for
	the process with PID 1. Can this also happen for a different process?
	</li>

	<li> Examine the <code> <a
	href="#pid_namespace_example">pid-ns.c</a></code> program. Will the
	two numbers printed as <code> PID </code> and <code> child PID </code>
	be the same? What will be the PPID number? Compile and run the program
	to see if your guess was correct.

	<li> Create a veth socket pair. Check that both ends of the pair are
	visible with <code> ip link show</code>. Start a second shell in a
	different network namespace and confirm by running the same command
	that no network interfaces exist in this namespace. In the original
	namespace, set the namespace of one end of the pair to the process ID
	of the second shell and confirm that the interface "moved" from one
	namespace to the other. Configure (different) IP addresses on both ends
	of the pair and transfer data through the ethernet tunnel between the
	two shell processes which reside in different network namespaces. </li>

	<li> Loopback, bridge, ppp and wireless are <em> network namespace
	local devices</em>, meaning that the namespace of such devices can
	not be changed. Explain why. Run <code> ethtool -k iface </code>
	to find out which devices are network namespace local. </li>

	<li> In a user namespace where the <code> uid_map </code> file has
	not been written, system calls like <code> setuid(2) </code> which
	change process UIDs fail. Why? </li>

	<li> What should happen if a set-user-ID program is executed inside
	of a user namespace and the on-disk UID of the program is not a mapped
	UID? </li>

	<li> Is it possible for a UID to map to different user names even if
	no user namespaces are in use? </li>

</ul>

HOMEWORK(«
The <code> shmctl(2) </code> system call performs operations on a System V
shared memory segment.  It operates on a <code> shmid_ds </code> structure
which contains in the <code> shm_lpid </code> field the PID of the process
which last attached or detached the segment. Describe the implications this API
detail has on the interaction between IPC and PID namespaces.
»)

SECTION(«Control Groups»)

<em> Control groups </em> (cgroups) allow processes to be grouped
and organized hierarchically in a tree. Each control group contains
processes which can be monitored or controlled as a unit, for example
by limiting the resources they can occupy. Several <em> controllers
</em> exist (CPU, memory, I/O, etc.), some of which actually impose
control while others only provide identification and relay control
to separate mechanisms. Unfortunately, control groups are not easy to
understand because the controllers are implemented in an inconsistent
way and because of the rather chaotic relationship between them.

In 2014 it was decided to rework the cgroup subsystem of the Linux
kernel. To keep existing applications working, the original cgroup
implementation, now called <em> cgroup-v1</em>, was retained and a
second, incompatible, cgroup implementation was designed. Cgroup-v2
aims to address the shortcomings of the first version, including its
inefficiency, inconsistency and the lack of interoperability among
controllers. The cgroup-v2 API was made official in 2016. Version 1
continues to work even if both implementations are active.

Both cgroup implementations provide a pseudo file system that
must be mounted in order to define and configure cgroups. The two
pseudo file systems may be mounted at the same time (on different
mountpoints). For both cgroup versions, the standard <code> mkdir(2)
</code> system call creates a new cgroup. To add a process to a cgroup
one must write its PID to one of the files in the pseudo file system.

We will cover both cgroup versions because as of 2018-11 many
applications still rely on cgroup-v1 and cgroup-v2 still lacks some
of the functionality of cgroup-v1. However, we will not look at
all controllers.

SUBSECTION(«CPU controllers»)

These controllers regulate the distribution of CPU cycles. The <em>
cpuset </em> controller of cgroup-v1 is the oldest cgroup controller,
it was implemented before the cgroups-v1 subsystem existed, which is
why it provides its own pseudo file system which is usually mounted at
<code>/dev/cpuset</code>. This file system is only kept for backwards
compability and is otherwise equivalent to the corresponding part of
the cgroup pseudo file system.  The cpuset controller links subsets
of CPUs to cgroups so that the processes in a cgroup are confined to
run only on the CPUs of "their" subset.

The CPU controller of cgroup-v2, which is simply called "cpu", works
differently. Instead of specifying the set of admissible CPUs for a
cgroup, one defines the ratio of CPU cycles for the cgroup.  Work to
support CPU partitioning as the cpuset controller of cgroup-v1 is in
progress and expected to be ready in 2019.

SUBSECTION(«Devices»)

The device controller of cgroup-v1 imposes mandatory access control
for device-special files. It tracks the <code> open(2) </code> and
<code> mknod(2) </code> system calls and enforces the restrictions
defined in the <em> device access whitelist </em> of the cgroup the
calling process belongs to.

Processes in the root cgroup have full permissions. Other cgroups
inherit the device permissions from their parent. A child cgroup
never has more permission than its parent.

Cgroup-v2 takes a completely different approach to device access
control. It is implemented on top of BPF, the <em> Berkeley packet
filter</em>. Hence this controller is not listed in the cgroup-v2
pseudo file system.

SUBSECTION(«Freezer»)

Both cgroup-v1 and cgroup-v2 implement a <em>freezer</em> controller,
which provides an ability to stop ("freeze") all processes in a
cgroup to free up resources for other tasks. The stopped processes can
be continued ("thawed") as a unit later. This is similar to sending
<code>SIGSTOP/SIGCONT</code> to all processes, but avoids some problems
with corner cases. The v2 version was added in 2019-07. It is available
from Linux-5.2 onwards.

SUBSECTION(«Memory»)

Cgroup-v1 offers three controllers related to memory management. First
there is the cpusetcontroller described above which can be instructed
to let processes allocate only memory which is close to the CPUs
of the cpuset. This makes sense on NUMA (non-uniform memory access)
systems where the memory access time for a given CPU depends on the
memory location. Second, the <em> hugetlb </em> controller manages
distribution and usage <em> of huge pages</em>. Third, there is the
<em> memory resource </em> controller which provides a number of
files in the cgroup pseudo file system to limit process memory usage,
swap usage and the usage of memory by the kernel on behalf of the
process. The most important tunable of the memory resource controller
is <code> limit_in_bytes</code>.

The cgroup-v2 version of the memory controller is rather more complex
because it attempts to limit direct and indirect memory usage of
the processes in a cgroup in a bullet-proof way. It is designed to
restrain even malicious processes which try to slow down or crash
the system by indirectly allocating memory. For example, a process
could try to create many threads or file descriptors which all cause a
(small) memory allocation in the kernel. Besides several tunables and
statistics, the memory controller provides the <code> memory.events
</code> file whose contents change whenever a state transition
for the cgroup occurs, for example when processes are started to get
throttled because the high memory boundary was exceeded. This file
could be monitored by a <em> management agent </em> to take appropriate
actions. The main mechanism to control the memory usage is the <code>
memory.high </code> file.

SUBSECTION(«I/O»)

I/O controllers regulate the distribution of IO resources among
cgroups. The throttling policy of cgroup-v2 can be used to enforce I/O
rate limits on arbitrary block devices, for example on a logical volume
provided by the logical volume manager (LVM). Read and write bandwidth
may be throttled independently. Moreover, the number of IOPS (I/O
operations per second) may also be throttled.  The I/O controller of
cgroup-v1 is called <em> blkio </em> while for cgroup-v2 it is simply
called <em> io</em>.  The features of the v1 and v2 I/O controllers
are identical but the filenames of the pseudo files and the syntax
for setting I/O limits differ. The exercises ask the reader to try
out both versions.

There is no cgroup-v2 controller for multi-queue schedulers so far.
However, there is the <em> I/O Latency </em> controller for cgroup-v2
which works for arbitrary block devices and all I/O schedulers. It
features <em> I/O workload protection </em> for the processes in
a cgroup. This works by throttling the processes in cgroups that
have a lower latency target than those in the protected cgroup. The
throttling is performed by lowering the depth of the request queue
of the affected devices.

EXERCISES()

<ul>
	<li> Run <code> mount -t cgroup none /var/cgroup </code> and <code>
	mount -t cgroup2 none /var/cgroup2 </code> to mount both cgroup pseudo
	file systems and explore the files they provide. </li>

	<li> Learn how to put the current shell into a new cgroup.
	Hints: For v1, start with <code> echo 0 > cpuset.mems && echo 0 >
	cpuset.cpus</code>. For v2: First activate controllers for the cgroup
	in the parent directory. </li>

	<li> Set up the cpuset controller so that your shell process has only
	access to a single CPU core. Test that the limitation is enforced by
	running <code>stress -c 2</code>. </li>

	<li> Repeat the above for the cgroup-v2 CPU controller. Hint: <code>
	echo 1000000 1000000 > cpu.max</code>. </li>

	<li> In a cgroup with one bash process, start a simple loop that prints
	some output: <code> while :; do date; sleep 1; done</code>. Freeze
	and unfreeze the cgroup by writing the string <code> FROZEN </code>
	to a suitable <code> freezer.state </code> file in the cgroup-v1 file
	system. Then unfreeze the cgroup by writing <code> THAWED </code>
	to the same file. Find out how one can tell whether a given cgroup
	is frozen. </li>

	<li> Pick a block device to throttle. Estimate its maximal read
	bandwidth by running a command like <code> ddrescue /dev/sdX
	/dev/null</code>.  Enforce a read bandwidth rate of 1M/s for the
	device by writing a string of the form <code> "$MAJOR:$MINOR $((1024 *
	1024))" </code> to a file named <code> blkio.throttle.read_bps_device
	</code> in the cgroup-v1 pseudo file system. Check that the bandwidth
	was indeed throttled by running the above <code> ddrescue </code>
	command again. </li>

	<li> Repeat the previous exercise, but this time use the cgroup-v2
	interface for the I/O controller. Hint: write a string of the form
	<code> $MAJOR:MINOR rbps=$((1024 * 1024))" </code> to a file named
	<code>io.max</code>. </li>

</ul>

HOMEWORK(«
<ul>

	<li> In one terminal running <code> bash</code>, start a second <code>
	bash </code> process and print its PID with <code> echo $$</code>.
	Guess what happens if you run <code> kill -STOP $PID; kill -CONT
	$PID</code> from a second terminal, where <code> $PID </code>
	is the PID that was printed in the first terminal. Try it out,
	explain the observed behaviour and discuss its impact on the freezer
	controller. Repeat the experiment but this time use the freezer
	controller to stop and restart the bash process. </li>
</ul>

»)

SECTION(«Linux Containers (LXC)»)

Containers provide resource management through control groups and
resource isolation through namespaces. A <em> container platform </em>
is thus a software layer implemented on top of these features. Given a
directory containing a Linux root file system, starting the container
is a simple matter: First <code> clone(2) </code> is called with the
proper <code> NEW_* </code> flags to create a new process in a suitable
set of namespaces. The child process then creates a cgroup for the
container and puts itself into it. The final step is to let the child
process hand over control to the container's <code> /sbin/init </code>
by calling <code> exec(2)</code>. When the last process in the newly
created namespaces exits, the namespaces disappear and the parent
process removes the cgroup. The details are a bit more complicated,
but the above covers the essence of what the container startup command
has to do.

Many container platforms offer additional features not to be discussed
here, like downloading and unpacking a file system image from the
internet, or supplying the root file system for the container by other
means, for example by creating an LVM snapshot of a master image.
LXC is a comparably simple container platform which can be used to
start a single daemon in a container, or to boot a container from
a root file system as described above. It provides several <code>
lxc-* </code> commands to start, stop and maintain containers.
LXC version 1 is much simpler than subsequent versions, and is still
being maintained, so we only discuss this version of LXC here.

An LXC container is defined by a configuration file in
the format described in <code> lxc.conf(5)</code>. A <a
href="#minimal_lxc_config_file"> minimal configuration </a> which
defines a network device and requests CPU and memory isolation has
as few as 10 lines (not counting comments). With the configuration
file and the root file system in place, the container can be started
by running <code> lxc-start -n $NAME</code>. One can log in to the
container on the local pseudo terminal or via ssh (provided the sshd
package is installed). The container can be stopped by executing
<code> halt </code> from within the container, or by running <code>
lxc-stop </code> on the host system. <code> lxc-ls </code> and
<code> lxc-info</code> print information about containers, and <code>
lxc-cgroup </code> changes the settings of the cgroup associated with
a container.

The exercises ask the reader to install the LXC package from source,
and to set up a minimal container running Ubuntu-18.04.

EXERCISES()

<ul>

	<li> Clone the LXC git repository from <code>
	https://github.com/lxc/lxc</code>, check out the <code> stable-1.0
	</code> tag. Compile the source code with <code> ./autogen.sh </code>
	and <code> ./configure && make</code>. Install with <code> sudo make
	install</code>. </li>

	<li> Download a minimal Ubuntu root file system with a command like
	<code> debootstrap --download-only --include isc-dhcp-client bionic
	/media/lxc/buru/ http://de.archive.ubuntu.com/ubuntu</code>. </li>

	<li> Set up an ethernet bridge as described in the <a
	href="./Networking.html#link_layer">Link Layer</a> section of the
	chapter on networking. </li>

	<li> Examine the <a href="#minimal_lxc_config_file"> minimal
	configuration file </a> for the container and copy it to <code>
	/var/lib/lxc/buru/config</code>. Adjust host name, MAC address and
	the name of the bridge interface. </li>

	<li> Start the container with <code> lxc-start -n buru</code>. </li>

	<li> While the container is running, investigate the control files of the
	cgroup pseudo file system. Identify the pseudo files which describe the
	CPU and memory limit. </li>

	<li> Come up with a suitable <code> lxc-cgroup </code> command
	to change the cpuset and the memory of the container while it is
	running. </li>

	<li> On the host system, create a loop device and a file system on
	it. Mount the file system on a subdirectory of the root file system
	of the container.  Note that the mount is not visible from within the
	container. Come up with a way to make it visible without restarting
	the container. </li>

</ul>

HOMEWORK(«Compare the features of LXC versions 1, 2 and 3.»)

SUPPLEMENTS()

SUBSECTION(«UTS Namespace Example»)
<pre>
	<code>
		#define _GNU_SOURCE
		#include &lt;sys/utsname.h&gt;
		#include &lt;sched.h&gt;
		#include &lt;stdio.h&gt;
		#include &lt;stdlib.h&gt;
		#include &lt;unistd.h&gt;

		static void print_hostname_and_exit(const char *pfx)
		{
			struct utsname uts;

			uname(&uts);
			printf("%s: %s\n", pfx, uts.nodename);
			exit(EXIT_SUCCESS);
		}

		static int child(void *arg)
		{
			sethostname("jesus", 5);
			print_hostname_and_exit("child");
		}

		#define STACK_SIZE (64 * 1024)
		static char child_stack[STACK_SIZE];

		int main(int argc, char *argv[])
		{
			clone(child, child_stack + STACK_SIZE, CLONE_NEWUTS, NULL);
			print_hostname_and_exit("parent");
		}
	</code>
</pre>

SUBSECTION(«PID Namespace Example»)
<pre>
	<code>
		#define _GNU_SOURCE
		#include &lt;sched.h&gt;
		#include &lt;unistd.h&gt;
		#include &lt;stdlib.h&gt;
		#include &lt;stdio.h&gt;

		static int child(void *arg)
		{
			printf("PID: %d, PPID: %d\n", (int)getpid(), (int)getppid());
		}

		#define STACK_SIZE (64 * 1024)
		static char child_stack[STACK_SIZE];

		int main(int argc, char *argv[])
		{
			pid_t pid = clone(child, child_stack + STACK_SIZE, CLONE_NEWPID, NULL);
			printf("child PID: %d\n", (int)pid);
			exit(EXIT_SUCCESS);
		}
	</code>
</pre>

SUBSECTION(«Minimal LXC Config File»)
<pre>
	<code>
		# Employ cgroups to limit the CPUs and the amount of memory the container is
		# allowed to use.
		lxc.cgroup.cpuset.cpus = 0-1
		lxc.cgroup.memory.limit_in_bytes = 2G

		# So that the container starts out with a fresh UTS namespace that
		# has already set its hostname.
		lxc.utsname = buru

		# LXC does not play ball if we don't set the type of the network device.
		# It will always be veth.
		lxc.network.type = veth

		# This sets the name of the veth pair which is visible on the host. This
		# way it is easy to tell which interface belongs to which container.
		lxc.network.veth.pair = buru

		# Of course we need to tell LXC where the root file system of the container
		# is located. LXC will automatically mount a couple of pseudo file systems
		# for the container, including /proc and /sys.
		lxc.rootfs = /media/lxc/buru

		# so that we can assign a fixed address via DHCP
		lxc.network.hwaddr = ac:de:48:32:35:cf

		# You must NOT have a link from /dev/kmsg pointing to /dev/console. In the host
		# it should be a real device. In a container it must NOT exist. When /dev/kmsg
		# points to /dev/console, systemd-journald reads from /dev/kmsg and then writes
		# to /dev/console (which it then reads from /dev/kmsg and writes again to
		# /dev/console ad infinitum). You've inadvertently created a messaging loop
		# that's causing systemd-journald to go berserk on your CPU.
		#
		# Make sure to remove /var/lib/lxc/${container}/rootfs.dev/kmsg
		lxc.kmsg = 0

		lxc.network.link = br39

		# This is needed for lxc-console
		lxc.tty = 4
	</code>
</pre>

SECTION(«Further Reading»)
<ul>
	<li> <a href="https://lwn.net/Articles/782876/">The creation of the
	io.latency block I/O controller</a>, by Josef Bacik: </li>
</ul>