TITLE(« Who the heck is General Failure, and why is he reading my disk? -- Unknown », __file__) OVERVIEW(« The idea of Logical Volume Management is to decouple data and storage. This offers great flexibility in managing storage and reduces server downtimes because the storage may be replaced while file systems are mounted read-write and applications are actively using them. This chapter provides an introduction to the Linux block layer and LVM. Subsequent sections cover selected device mapper targets. ») SECTION(«The Linux Block Layer»)

The main task of LVM is the management of block devices, so it is natural to start an introduction to LVM with a section on the Linux block layer, which is the central component in the Linux kernel for the handling of persistent storage devices. The mission of the block layer is to provide a uniform interface to different types of storage devices. The obvious in-kernel users of this interface are the file systems and the swap subsystem. But also stacking device drivers like LVM, Bcache and MD access block devices through this interface to create virtual block devices from other block devices. Some user space programs (fdisk, dd, mkfs, ...) also need to access block devices. The block layer allows them to perform their task in a well-defined and uniform manner through block-special device files.

The userspace programs and the in-kernel users interact with the block layer by sending read or write requests. A bio is the central data structure that carries such requests within the kernel. Bios may contain an arbitrary amount of data. They are given to the block layer to be queued for subsequent handling. Often a bio has to travel through a stack of block device drivers where each driver modifies the bio and sends it on to the next driver. Typically, only the last driver in the stack corresponds to a hardware device.

Besides requests to read or write data blocks, there are various other bio requests that carry SCSI commands like FLUSH, FUA (Force Unit Access), TRIM and UNMAP. FLUSH and FUA ensure that certain data hits stable storage. FLUSH asks the the device to write out the contents of its volatile write cache while a FUA request carries data that should be written directly to the device, bypassing all caches. UNMAP/TRIM is a SCSI/ATA command which is only relevant to SSDs. It is a promise of the OS to not read the given range of blocks any more, so the device is free to discard the contents and return arbitrary data on the next read. This helps the device to level out the number of times the flash storage cells are overwritten (wear-leveling), which improves the durability of the device.

The first task of the block layer is to split incoming bios if necessary to make them conform to the size limit or the alignment requirements of the target device, and to batch and merge bios so that they can be submitted as a unit for performance reasons. The thusly processed bios then form an I/O request which is handed to an I/O scheduler (also known as elevator).

At this time of writing (2018-11) there exist two different sets of schedulers: the traditional single-queue schedulers and the modern multi-queue schedulers, which are expected to replace the single-queue schedulers soon. The three single-queue schedulers, noop, deadline and cfq (complete fair queueing), were designed for rotating disks. They reorder requests with the aim to minimize seek time. The newer multi-queue schedulers, mq-deadline, kyber, and bfq (budget fair queueing), aim to max out even the fastest devices. As implied by the name "multi-queue", they implement several request queues, the number of which depends on the hardware in use. This has become necessary because modern storage hardware allows multiple requests to be submitted in parallel from different CPUs. Moreover, with many CPUs the locking overhead required to put a request into a queue increases. Per-CPU queues allow for per-CPU locks, which decreases queue lock contention.

We will take a look at some aspects of the Linux block layer and on the various I/O schedulers. An exercise on loop devices enables the reader to create block devices for testing. This will be handy in the subsequent sections on LVM specific topics.

EXERCISES() HOMEWORK(« ») define(«svg_disk», « ») SECTION(«Physical and Logical Volumes, Volume Groups»)

Getting started with the Logical Volume Manager (LVM) requires to get used to a minimal set of vocabulary. This section introduces the words named in the title of the section, and a couple more. The basic concepts of LVM are then described in terms of these words.

define(lvm_width», «300») define(«lvm_height», «183») define(«lvm_margin», «10») define(«lvm_extent_size», «10») define(«lvm_extent», « ») dnl $1: color, $2: x, $3: y, $4: number of extents define(«lvm_extents», « ifelse(«$4», «0», «», « lvm_extent(«$1», «$2», «$3») lvm_extents(«$1», eval($2 + lvm_extent_size() + lvm_margin()), «$3», eval($4 - 1)) ») ») dnl $1: x, $2: y, $3: number of extents, $4: disk color, $5: extent color define(«lvm_disk», « ifelse(eval(«$3» > 3), «1», « pushdef(«h», «eval(7 * lvm_extent_size())») pushdef(«w», «eval(($3 + 1) * lvm_extent_size())») », « pushdef(«h», «eval(3 * lvm_extent_size() + lvm_margin())») pushdef(«w», «eval($3 * lvm_extent_size() * 2)») ») svg_disk(«$1», «$2», «w()», «h()», «$4») ifelse(eval(«$3» > 3), «1», « pushdef(«n1», eval(«$3» / 2)) pushdef(«n2», eval(«$3» - n1())) lvm_extents(«$5», eval(«$1» + (w() - (2 * n1() - 1) * lvm_extent_size()) / 2), eval(«$2» + h() / 2 - lvm_extent_size()), «n1()») lvm_extents(«$5», eval(«$1» + (w() - (2 * n2() - 1) * lvm_extent_size()) / 2), eval(«$2» + h() / 2 + 2 * lvm_extent_size()), «n2()») popdef(«n1») popdef(«n2») », « lvm_extents(«$5», eval(«$1» + (w() - (2 * «$3» - 1) * lvm_extent_size()) / 2), eval(«$2» + h() / 2), «$3») ») popdef(«w») popdef(«h») ») lvm_disk(«20», «20», «2», «#666», «yellow») lvm_disk(«10», «90», «4», «#666», «yellow») lvm_disk(«70», «55», «5», «#666», «yellow») lvm_disk(«190», «22», «7», «#66f», «orange») lvm_disk(«220», «130», «1», «#66f», «orange»)

A Physical Volume (PV, grey) is an arbitrary block device which contains a certain metadata header (also known as superblock) at the start. PVs can be partitions on a local hard disk or a SSD, a soft- or hardware raid, or a loop device. LVM does not care. The storage space on a physical volume is managed in units called Physical Extents (PEs, yellow). The default PE size is 4M.

A Volume Group (VG, green) is a non-empty set of PVs with a name and a unique ID assigned to it. A PV can but doesn't need to be assigned to a VG. If it is, the ID of the associated VG is stored in the metadata header of the PV.

Logical Volume (LV, blue) is a named block device which is provided by LVM. LVs are always associated with a VG and are stored on that VG's PVs. Since LVs are normal block devices, file systems of any type can be created on them, they can be used as swap storage, etc. The chunks of a LV are managed as Logical Extents (LEs, orange). Often the LE size equals the PE size. For each LV there is a mapping between the LEs of the LV and the PEs of the underlying PVs. The PEs can spread multiple PVs.

VGs can be extended by adding additional PVs to it, or reduced by removing unused devices, i.e., those with no PEs allocated on them. PEs may be moved from one PV to another while the LVs are active. LVs may be grown or shrunk. To grow a LV, there must be enough space left in the VG. Growing a LV does not magically grow the file system stored on it, however. To make use of the additional space, a second, file system specific step is needed to tell the file system that it's underlying block device (the LV) has grown.

The exercises of this section illustrate the basic LVM concepts and the essential LVM commands. They ask the reader to create a VG whose PVs are loop devices. This VG is used as a starting point in subsequent chapters.

EXERCISES() HOMEWORK(« In the above scenario (two LVs in a VG consisting of two PVs), how can you tell whether both PVs are actually used? Remove the LVs with lvremove. Recreate them, but this time use the --stripes 2 option to lvcreate. Explain what this option does and confirm with a suitable command. ») SECTION(«Device Mapper and Device Mapper Targets»)

The kernel part of the Logical Volume Manager (LVM) is called device mapper (DM), which is a generic framework to map one block device to another. Applications talk to the Device Mapper via the libdevmapper library, which issues requests to the /dev/mapper/control character device using the ioctl(2) system call. The device mapper is also accessible from scripts via the dmsetup(8) tool.

A DM target represents one particular mapping type for ranges of LEs. Several DM targets exist, each of which which creates and maintains block devices with certain characteristics. In this section we take a look at the dmsetup tool and the relatively simple mirror target. Subsequent sections cover other targets in more detail.

EXERCISES() HOMEWORK(« As advertised in the introduction, LVM allows the administrator to replace the underlying storage of a file system online. This is done by running a suitable pvmove(8) command to move all PEs of one PV to different PVs in the same VG. ») SECTION(«LVM Snapshots»)

LVM snapshots are based on the CoW optimization strategy described earlier in the chapter on Unix Concepts. Creating a snapshot means to create a CoW table of the given size. Just before a LE of a snapshotted LV is about to be written to, its contents are copied to a free slot in the CoW table. This preserves an old version of the LV, the snapshot, which can later be reconstructed by overlaying the CoW table atop the LV.

Snapshots can be taken from a LV which contains a mounted file system, while applications are actively modifying files. Without coordination between the file system and LVM, the file system most likely has memory buffers scheduled for writeback. These outstanding writes did not make it to the snapshot, so one can not expect the snapshot to contain a consistent file system image. Instead, it is in a similar state as a regular device after an unclean shutdown. This is not a problem for XFS and EXT4, as both are journalling file systems, which were designed with crash recovery in mind. At the next mount after a crash, journalling file systems replay their journal, which results in a consistent state. Note that this implies that even a read-only mount of the snapshot device has to write to the device.

EXERCISES() SECTION(«Thin Provisioning»)

The term "thin provisioning" is just a modern buzzword for over-subscription. Both terms mean to give the appearance of having more resources than are actually available. This is achieved by on-demand allocation. The thin provisioning implementation of Linux is implemented as a DM target called dm-thin. This code first made its appearance in 2011 and was declared as stable two years later. These days it should be safe for production use.

The general problem with thin provisioning is of course that bad things happen when the resources are exhausted because the demand has increased before new resources were added. For dm-thin this can happen when users write to their allotted space, causing dm-thin to attempt allocating a data block from a volume which is already full. This usually leads to severe data corruption because file systems are not really prepared to handle this error case and treat it as if the underlying block device had failed. dm-thin does nothing to prevent this, but one can configure a low watermark. When the number of free data blocks drops below the watermark, a so-called dm-event will be generated to notice the administrator.

One highlight of dm-thin is its efficient support for an arbitrary depth of recursive snapshots, called dm-thin snapshots in this document. With the traditional snapshot implementation, recursive snapshots quickly become a performance issue as the depth increases. With dm-thin one can have an arbitrary subset of all snapshots active at any point in time, and there is no ordering requirement on activating or removing them.

The block devices created by dm-thin always belong to a thin pool which ties together two LVs called the metadata LV and the data LV. The combined LV is called the thin pool LV. Setting up a VG for thin provisioning is done in two steps: First the standard LVs for data and the metatdata are created. Second, the two LVs are combined into a thin pool LV. The second step hides the two underlying LVs so that only the combined thin pool LV is visible afterwards. Thin provisioned LVs and dm-thin snapshots can then be created from the thin pool LV with a single command.

Another nice feature of dm-thin are external snapshots. An external snapshot is one where the origin for a thinly provisioned device is not a device of the pool. Arbitrary read-only block devices can be turned into writable devices by creating an external snapshot. Reads to an unprovisioned area of the snapshot will be passed through to the origin. Writes trigger the allocation of new blocks as usual with CoW. One use case for this is VM hosts which run their VMs on thinly-provisioned volumes but have the base image on some "master" device which is read-only and can hence be shared between all VMs.

EXERCISES()

Starting with the tvg VG, create and test a thin pool LV by performing the following steps. The "Thin Usage" section of lvmthin(7) will be helpful.

HOMEWORK(« When a thin pool provisions a new data block for a thin LV, the new block is first overwritten with zeros by default. Discuss why this is done, its impact on performance and security, and conclude whether or not it is a good idea to turn off the zeroing. ») SECTION(«Bcache, dm-cache and dm-writecache»)

All three implementations named in the title of this chapter are Linux block layer caches. They combine two different block devices to form a hybrid block device which dynamically caches and migrates data between the two devices with the aim to improve performance. One device, the backing device, is expected to be large and slow while the other one, the cache device, is expected to be small and fast.

define(«bch_width», «300») define(«bch_height», «130») define(«bch_margin», «10») define(«bch_rraid_width», «eval((bch_width() - 4 * bch_margin()) * 4 / 5)») define(«bch_raidbox_height», «eval(bch_height() - 2 * bch_margin())») define(«bch_nraid_width», «eval(bch_rraid_width() / 4)») define(«bch_rdisk_width», «eval((bch_width() - 3 * bch_margin()) * 18 / 100)») define(«bch_rdisk_height», «eval((bch_height() - 4 * bch_margin()) / 3)») define(«bch_ndisk_width», «eval(bch_rdisk_width() / 2)») define(«bch_ndisk_height», «eval(bch_raidbox_height() - 5 * bch_margin())») define(«bch_rdisk», «svg_disk(«$1», «$2», «bch_rdisk_width()», «bch_rdisk_height()», «#666»)») define(«bch_ndisk», «svg_disk(«$1», «$2», «bch_ndisk_width()», «bch_ndisk_height()», «#66f»)») define(«bch_5rdisk», « bch_rdisk(«$1», «$2») bch_rdisk(«eval($1 + bch_margin())», «eval($2 + bch_margin())») bch_rdisk(«eval($1 + 2 * bch_margin())», «eval($2 + 2 * bch_margin())») bch_rdisk(«eval($1 + 3 * bch_margin())», «eval($2 + 3 * bch_margin())») bch_rdisk(«eval($1 + 4 * bch_margin())», «eval($2 + 4 * bch_margin())») ») define(«bch_rraid», « bch_5rdisk(«eval($1 + bch_margin())», «eval($2 + 2 * bch_margin())») bch_5rdisk(«eval($1 + 2 * bch_rdisk_width() + bch_margin())», «eval($2 + 2 * bch_margin())») ») define(«bch_nraid», « bch_ndisk(eval($1 + bch_margin()), eval($2 + 2 * bch_margin())) bch_ndisk(eval($1 + 2 * bch_margin()), eval($2 + 3 * bch_margin())) ») bch_nraid(«bch_margin()», «bch_margin()») bch_rraid(«eval(2 * bch_margin() + bch_nraid_width())», «bch_margin()»)

The most simple setup consists of a single rotating disk and one SSD. The setup shown in the diagram at the left is realistic for a large server with redundant storage. In this setup the hybrid device (yellow) combines a raid6 array (green) consisting of many rotating disks (grey) with a two-disk raid1 array (orange) stored on fast NVMe devices (blue). In the simple setup it is always a win when I/O is performed from/to the SSD instead of the rotating disk. In the server setup, however, it depends on the workload which device is faster. Given enough rotating disks and a streaming I/O workload, the raid6 outperforms the raid1 because all disks can read or write at full speed.

Since block layer caches hook into the Linux block API described earlier, the hybrid block devices they provide can be used like any other block device. In particular, the hybrid devices are file system agnostic, meaning that any file system can be created on them. In what follows we briefly describe the differences between the three block layer caches and conclude with the pros and cons of each.

Bcache is a stand-alone stacking device driver which was included in the Linux kernel in 2013. According to the bcache home page, it is "done and stable". dm-cache and dm-writecache are device mapper targets included in 2013 and 2018, respectively, which are both marked as experimental. In contrast to dm-cache, dm-writecache only caches writes while reads are supposed to be cached in RAM. It has been designed for programs like databases which need low commit latency. Both bcache and dm-cache can operate in writeback or writethrough mode while dm-writecache always operates in writeback mode.

The DM-based caches are designed to leave the decision as to what data to migrate (and when) to user space while bcache has this policy built-in. However, at this point only the Stochastic Multiqueue (smq) policy for dm-cache exists, plus a second policy which is only useful for decommissioning the cache device. There are no tunables for dm-cache while all the bells and whistles of bcache can be configured through sysfs files. Another difference is that bcache detects sequential I/O and separates it from random I/O so that large streaming reads and writes bypass the cache and don't push cached randomly accessed data out of the cache.

bcache is the clear winner of this comparison because it is stable, configurable and performs better at least on the server setup described above because it separate random and sequential I/O. The only advantage of dm-cache is its flexibility because cache policies can be switched. But even this remains a theoretical advantage as long as only a single policy for dm-cache exists.

EXERCISES() HOMEWORK(« Explain why small writes to a file system which is stored on a parity raid result in read-modify-write (RMW) updates. Explain why RMW updates are particularly expensive and how raid implementations and block layer caches try to avoid them. ») HOMEWORK(« Recall the concepts of writeback and writethrough. Describe what each mode means for a hardware device and for a bcache/dm-cache device. Explain why writeback is faster and writethrough is safer. ») HOMEWORK(« TRIM and UNMAP are special commands in the ATA/SCSI command sets which inform an SSD that certain data blocks are no longer in use, allowing the SSD to re-use these blocks to increase performance and to reduce wear. Subsequent reads from the trimmed data blocks will not return any meaningful data. For example, the mkfs commands sends this command to discard all blocks of the device. Discuss the implications when mkfs. is run on a device provided by bcache or dm-cache. ») SECTION(«The dm-crypt Target»)

This device mapper target provides encryption of arbitrary block devices by employing the primitives of the crypto API of the Linux kernel. This API provides a uniform interface to a large number of cipher algorithms which have been implemented with performance and security in mind.

The cipher algorithm of choice for the encryption of block devices is the Advanced Encryption Standard (AES), also known as Rijndael, named after the two Belgian cryptographers Rijmen and Daemen who proposed the algorithm in 1999. AES is a symmetric block cipher. That is, a transformation which operates on fixed-length blocks and which is determined by a single key for both encryption and decryption. The underlying algorithm is fairly simple, which makes AES perform well in both hardware and software. Also the key setup time and the memory requirements are excellent. Modern processors of all manufacturers include instructions to perform AES operations in hardware, improving speed and security.

According to the Snowden documents, the NSA has been doing research on breaking AES for a long time without being able to come up with a practical attack for 256 bit keys. Successful attacks invariably target the key management software instead, which is often implemented poorly, trading security for user-friendliness, for example by storing passwords weakly encrypted, or by providing a "feature" which can decrypt the device without knowing the password.

The exercises of this section ask the reader to encrypt a loop device with AES without relying on any third party key management software

. EXERCISES() HOMEWORK(« Why is it a good idea to overwrite a block device with random data before it is encrypted? ») HOMEWORK(« The dm-crypt target encrypts whole block devices. An alternative is to encrypt on the file system level. That is, each file is encrypted separately. Discuss the pros and cons of both approaches. ») SUPPLEMENTS() SUBSECTION(«Random stream»)
	
		/* Link with -lcrypto */
		#include <openssl/rand.h>
		#include <stdio.h>
		#include <unistd.h>
		#include <stdio.h>

		int main(int argc, char **argv)
		{
			unsigned char buf[1024 * 1024];

			for (;;) {
				int ret = RAND_bytes(buf, sizeof(buf));

				if (ret <= 0) {
					fprintf(stderr, "RAND_bytes() error\n");
					exit(EXIT_FAILURE);
				}
				ret = write(STDOUT_FILENO, buf, sizeof(buf));
				if (ret < 0) {
					perror("write");
					exit(EXIT_FAILURE);
				}
			}
			return 0;
		}