3 Who the heck is General Failure, and why is he reading my disk? -- Unknown
9 The idea of Logical Volume Management is to decouple data and
10 storage. This offers great flexibility in managing storage and reduces
11 server downtimes because the storage may be replaced while file
12 systems are mounted read-write and applications are actively using
13 them. This chapter provides an introduction to the Linux block layer
14 and LVM. Subsequent sections cover selected device mapper targets.
18 SECTION(«The Linux Block Layer»)
20 <p> The main task of LVM is the management of block devices, so it is
21 natural to start an introduction to LVM with a section on the Linux
22 block layer, which is the central component in the Linux kernel
23 for the handling of persistent storage devices. The mission of the
24 block layer is to provide a uniform interface to different types
25 of storage devices. The obvious in-kernel users of this interface
26 are the file systems and the swap subsystem. But also <em> stacking
27 device drivers </em> like LVM, Bcache and MD access block devices
28 through this interface to create virtual block devices from other block
29 devices. Some user space programs (<code>fdisk, dd, mkfs, ...</code>)
30 also need to access block devices. The block layer allows them to
31 perform their task in a well-defined and uniform manner through
32 block-special device files. </p>
34 <p> The userspace programs and the in-kernel users interact with the block
35 layer by sending read or write requests. A <em>bio</em> is the central
36 data structure that carries such requests within the kernel. Bios
37 may contain an arbitrary amount of data. They are given to the block
38 layer to be queued for subsequent handling. Often a bio has to travel
39 through a stack of block device drivers where each driver modifies
40 the bio and sends it on to the next driver. Typically, only the last
41 driver in the stack corresponds to a hardware device. </p>
43 <p> Besides requests to read or write data blocks, there are various other
44 bio requests that carry SCSI commands like FLUSH, FUA (Force Unit
45 Access), TRIM and UNMAP. FLUSH and FUA ensure that certain data hits
46 stable storage. FLUSH asks the the device to write out the contents of
47 its volatile write cache while a FUA request carries data that should
48 be written directly to the device, bypassing all caches. UNMAP/TRIM is
49 a SCSI/ATA command which is only relevant to SSDs. It is a promise of
50 the OS to not read the given range of blocks any more, so the device
51 is free to discard the contents and return arbitrary data on the
52 next read. This helps the device to level out the number of times
53 the flash storage cells are overwritten (<em>wear-leveling</em>),
54 which improves the durability of the device. </p>
56 <p> The first task of the block layer is to split incoming bios if
57 necessary to make them conform to the size limit or the alignment
58 requirements of the target device, and to batch and merge bios so that
59 they can be submitted as a unit for performance reasons. The thusly
60 processed bios then form an I/O request which is handed to an <em>
61 I/O scheduler </em> (also known as <em> elevator</em>). </p>
63 <p> Traditionally, the schedulers were designed for rotating disks.
64 They implemented a single request queue and reordered the queued
65 I/O requests with the aim to minimize disk seek times. The newer
66 multi-queue schedulers mq-deadline, kyber, and bfq (budget fair
67 queueing) aim to max out even the fastest devices. As implied by
68 the name "multi-queue", they implement several request queues,
69 the number of which depends on the hardware in use. This has become
70 necessary because modern storage hardware allows multiple requests
71 to be submitted in parallel from different CPUs. Moreover, with many
72 CPUs the locking overhead required to put a request into a queue
73 increases. Per-CPU queues allow for per-CPU locks, which decreases
74 queue lock contention. </p>
76 <p> We will take a look at some aspects of the Linux block layer and on
77 the various I/O schedulers. An exercise on loop devices enables the
78 reader to create block devices for testing. This will be handy in
79 the subsequent sections on LVM specific topics. </p>
85 <li> Run <code>find /dev -type b</code> to get the list of all block
86 devices on your system. Explain which is which. </li>
88 <li> Examine the files in <code>/sys/block/sda</code>, in
89 particular <code>/sys/block/sda/stat</code>. Search the web for
90 <code>Documentation/block/stat.txt</code> for the meaning of the
91 numbers shown. Then run <code>iostat -xdh sda 1</code>. </li>
93 <li> Examine the files in <code>/sys/block/sda/queue</code>. </li>
95 <li> Find out how to determine the size of a block device. </li>
97 <li> Figure out a way to identify the name of all block devices which
98 correspond to SSDs (i.e., excluding any rotating disks). </li>
100 <li> Run <code>lsblk</code> and discuss
101 the output. Too easy? Run <code>lsblk -o
102 KNAME,PHY-SEC,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,RQ-SIZE,ROTA,SCHED</code>
105 <li> What's the difference between a task scheduler and an I/O
108 <li> Why are I/O schedulers also called elevators? </li>
110 <li> How can one find out which I/O schedulers are supported on a
111 system and which scheduler is active for a given block device? </li>
113 <li> Is it possible (and safe) to change the I/O scheduler for a
114 block device while it is in use? If so, how can this be done? </li>
116 <li> The loop device driver of the Linux kernel allows privileged
117 users to create a block device from a regular file stored on a file
118 system. The resulting block device is called a <em>loop</em> device.
119 Create a 1G large temporary file containing only zeroes. Run a suitable
120 <code>losetup(8)</code> command to create a loop device from the
121 file. Create an XFS file system on the loop device and mount it. </li>
128 <li> Come up with three different use cases for loop devices. </li>
130 <li> Given a block device node in <code> /dev</code>, how can one
131 tell that it is a loop device? </li>
133 <li> Describe the connection between loop devices created by
134 <code>losetup(8)</code> and the loopback device used for network
135 connections from the machine to itself. </li>
147 cx="eval($1 + $3 / 2)"
160 y="eval($2 + $4 - 1)"
166 cx="eval($1 + $3 / 2)"
174 SECTION(«Physical and Logical Volumes, Volume Groups»)
176 <p> Getting started with the Logical Volume Manager (LVM) requires to
177 get used to a minimal set of vocabulary. This section introduces
178 the words named in the title of the section, and a couple more.
179 The basic concepts of LVM are then described in terms of these words. </p>
182 define(lvm_width», «300»)
183 define(«lvm_height», «183»)
184 define(«lvm_margin», «10»)
185 define(«lvm_extent_size», «10»)
186 define(«lvm_extent», «
191 width="lvm_extent_size()"
192 height="lvm_extent_size()"
197 dnl $1: color, $2: x, $3: y, $4: number of extents
198 define(«lvm_extents», «
199 ifelse(«$4», «0», «», «
200 lvm_extent(«$1», «$2», «$3»)
201 lvm_extents(«$1», eval($2 + lvm_extent_size() + lvm_margin()),
205 dnl $1: x, $2: y, $3: number of extents, $4: disk color, $5: extent color
207 ifelse(eval(«$3» > 3), «1», «
208 pushdef(«h», «eval(7 * lvm_extent_size())»)
209 pushdef(«w», «eval(($3 + 1) * lvm_extent_size())»)
211 pushdef(«h», «eval(3 * lvm_extent_size() + lvm_margin())»)
212 pushdef(«w», «eval($3 * lvm_extent_size() * 2)»)
214 svg_disk(«$1», «$2», «w()», «h()», «$4»)
215 ifelse(eval(«$3» > 3), «1», «
216 pushdef(«n1», eval(«$3» / 2))
217 pushdef(«n2», eval(«$3» - n1()))
219 eval(«$1» + (w() - (2 * n1() - 1) * lvm_extent_size()) / 2),
220 eval(«$2» + h() / 2 - lvm_extent_size()), «n1()»)
222 eval(«$1» + (w() - (2 * n2() - 1) * lvm_extent_size()) / 2),
223 eval(«$2» + h() / 2 + 2 * lvm_extent_size()), «n2()»)
228 eval(«$1» + (w() - (2 * «$3» - 1) * lvm_extent_size()) / 2),
229 eval(«$2» + h() / 2), «$3»)
235 width="lvm_width()" height="lvm_height()"
236 xmlns="http://www.w3.org/2000/svg"
237 xmlns:xlink="http://www.w3.org/1999/xlink"
249 lvm_disk(«20», «20», «2», «#666», «yellow»)
250 lvm_disk(«10», «90», «4», «#666», «yellow»)
251 lvm_disk(«70», «55», «5», «#666», «yellow»)
266 lvm_disk(«190», «22», «7», «#66f», «orange»)
267 lvm_disk(«220», «130», «1», «#66f», «orange»)
271 <p> A <em> Physical Volume</em> (PV, grey) is an arbitrary block device which
272 contains a certain metadata header (also known as <em>superblock</em>)
273 at the start. PVs can be partitions on a local hard disk or a SSD,
274 a soft- or hardware raid, or a loop device. LVM does not care.
275 The storage space on a physical volume is managed in units called <em>
276 Physical Extents </em> (PEs, yellow). The default PE size is 4M. </p>
278 <p> A <em>Volume Group</em> (VG, green) is a non-empty set of PVs with
279 a name and a unique ID assigned to it. A PV can but doesn't need to
280 be assigned to a VG. If it is, the ID of the associated VG is stored
281 in the metadata header of the PV. </p>
283 <p> A <em> Logical Volume</em> (LV, blue) is a named block device which is
284 provided by LVM. LVs are always associated with a VG and are stored
285 on that VG's PVs. Since LVs are normal block devices, file systems
286 of any type can be created on them, they can be used as swap storage,
287 etc. The chunks of a LV are managed as <em>Logical Extents</em> (LEs,
288 orange). Often the LE size equals the PE size. For each LV there is
289 a mapping between the LEs of the LV and the PEs of the underlying
290 PVs. The PEs can spread multiple PVs. </p>
292 <p> VGs can be extended by adding additional PVs to it, or reduced by
293 removing unused devices, i.e., those with no PEs allocated on them. PEs
294 may be moved from one PV to another while the LVs are active. LVs
295 may be grown or shrunk. To grow a LV, there must be enough space
296 left in the VG. Growing a LV does not magically grow the file system
297 stored on it, however. To make use of the additional space, a second,
298 file system specific step is needed to tell the file system that it's
299 underlying block device (the LV) has grown. </p>
301 <p> The exercises of this section illustrate the basic LVM concepts
302 and the essential LVM commands. They ask the reader to create a VG
303 whose PVs are loop devices. This VG is used as a starting point in
304 subsequent chapters. </p>
310 <li> Create two 5G large loop devices <code>/dev/loop1</code>
311 and <code>/dev/loop2</code>. Make them PVs by running
312 <code>pvcreate</code>. Create a VG <code>tvg</code> (test volume group)
313 from the two loop devices and two 3G large LVs named <code>tlv1</code>
314 and <code>tlv2</code> on it. Run the <code>pvcreate, vgcreate</code>,
315 and <code>lvcreate</code> commands with <code>-v</code> to activate
316 verbose output and try to understand each output line. </li>
318 <li> Run <code>pvs, vgs, lvs, lvdisplay, pvdisplay</code> and examine
321 <li> Run <code>lvdisplay -m</code> to examine the mapping of logical
322 extents to PVs and physical extents. </li>
324 <li> Run <code>pvs --segments -o+lv_name,seg_start_pe,segtype</code>
325 to see the map between physical extents and logical extents. </li>
331 In the above scenario (two LVs in a VG consisting of two PVs), how
332 can you tell whether both PVs are actually used? Remove the LVs
333 with <code>lvremove</code>. Recreate them, but this time use the
334 <code>--stripes 2</code> option to <code>lvcreate</code>. Explain
335 what this option does and confirm with a suitable command.
339 SECTION(«Device Mapper and Device Mapper Targets»)
341 <p> The kernel part of the Logical Volume Manager (LVM) is called
342 <em>device mapper</em> (DM), which is a generic framework to map
343 one block device to another. Applications talk to the Device Mapper
344 via the <em>libdevmapper</em> library, which issues requests
345 to the <code>/dev/mapper/control</code> character device using the
346 <code>ioctl(2)</code> system call. The device mapper is also accessible
347 from scripts via the <code>dmsetup(8)</code> tool. </p>
349 <p> A DM target represents one particular mapping type for ranges
350 of LEs. Several DM targets exist, each of which which creates and
351 maintains block devices with certain characteristics. In this section
352 we take a look at the <code>dmsetup</code> tool and the relatively
353 simple <em>mirror</em> target. Subsequent sections cover other targets
360 <li> Run <code>dmsetup targets</code> to list all targets supported
361 by the currently running kernel. Explain their purpose and typical
364 <li> Starting with the <code>tvg</code> VG, remove <code>tlv2</code>.
365 Convince yourself by running <code>vgs</code> that <code>tvg</code>
366 is 10G large, with 3G being in use. Run <code>pvmove
367 /dev/loop1</code> to move the used PEs of <code>/dev/loop1</code>
368 to <code>/dev/loop2</code>. After the command completes, run
369 <code>pvs</code> again to see that <code>/dev/loop1</code> has no
370 more PEs in use. </li>
372 <li> Create a third 5G loop device <code>/dev/loop3</code>, make it a
373 PV and extend the VG with <code>vgextend tvg /dev/loop3</code>. Remove
374 <code>tlv1</code>. Now the LEs of <code>tlv2</code> fit on any
375 of the three PVs. Come up with a command which moves them to
376 <code>/dev/loop3</code>. </li>
378 <li> The first two loop devices are both unused. Remove them from
379 the VG with <code>vgreduce -a</code>. Why are they still listed in
380 the <code>pvs</code> output? What can be done about that? </li>
386 As advertised in the introduction, LVM allows the administrator to
387 replace the underlying storage of a file system online. This is done
388 by running a suitable <code>pvmove(8)</code> command to move all PEs of
389 one PV to different PVs in the same VG.
393 <li> Explain the mapping type of dm-mirror. </li>
395 <li> The traditional way to mirror the contents of two or more block
396 devices is software raid 1, also known as <em>md raid1</em> ("md"
397 is short for multi-disk). Explain the difference between md raid1,
398 the dm-raid target which supports raid1 and other raid levels, and
399 the dm-mirror target. </li>
401 <li> Guess how <code>pvmove</code> is implemented on top of
402 dm-mirror. Verify your guess by reading the "NOTES" section of the
403 <code>pvmove(8)</code> man page. </li>
408 SECTION(«LVM Snapshots»)
410 <p> LVM snapshots are based on the CoW optimization strategy described
411 earlier in the chapter on <a href="./Unix_Concepts.html#processes">Unix
412 Concepts</a>. Creating a snapshot means to create a CoW table of the
413 given size. Just before a LE of a snapshotted LV is about to be written
414 to, its contents are copied to a free slot in the CoW table. This
415 preserves an old version of the LV, the snapshot, which can later be
416 reconstructed by overlaying the CoW table atop the LV. </p>
418 <p> Snapshots can be taken from a LV which contains a mounted file system,
419 while applications are actively modifying files. Without coordination
420 between the file system and LVM, the file system most likely has memory
421 buffers scheduled for writeback. These outstanding writes did not make
422 it to the snapshot, so one can not expect the snapshot to contain a
423 consistent file system image. Instead, it is in a similar state as a
424 regular device after an unclean shutdown. This is not a problem for
425 XFS and EXT4, as both are <em>journalling</em> file systems, which
426 were designed with crash recovery in mind. At the next mount after a
427 crash, journalling file systems replay their journal, which results
428 in a consistent state. Note that this implies that even a read-only
429 mount of the snapshot device has to write to the device. </p>
435 <li> In the test VG, create a 1G large snapshot named
436 <code>snap_tlv1</code> of the <code>tlv1</code> VG by using the
437 <code>-s</code> option to <code>lvcreate(8)</code>. Predict how much
438 free space is left in the VG. Confirm with <code>vgs tvg</code>. </li>
440 <li> Create an EXT4 file system on <code>tlv1</code> by running
441 <code>mkfs.ext4 /dev/tvg/lv1</code>. Guess how much of the snapshot
442 space has been allocated by this operation. Check with <code>lvs
443 tvg1/snap_lv1</code>. </li>
445 <li> Remove the snapshot with <code>lvremove</code> and recreate
446 it. Repeat the previous step, but this time run <code>mkfs.xfs</code>
447 to create an XFS file system. Run <code>lvs tvg/snap_lv1</code>
448 again and compare the used snapshot space to the EXT4 case. Explain
449 the difference. </li>
451 <li> Remove the snapshot and recreate it so that both <code>tlv1</code>
452 and <code>snap_tlv1</code> contain a valid XFS file system. Mount
453 the file systems on <code>/mnt/1</code> and <code>/mnt/2</code>. </li>
455 <li> Run <code>dd if=/dev/zero of=/mnt/1/zero count=$((2 * 100 *
456 1024))</code> to create a 100M large file on <code>tlv1</code>. Check
457 that <code>/mnt/2</code> is still empty. Estimate how much of the
458 snapshot space is used and check again. </li>
460 <li> Repeat the above <code>dd</code> command 5 times and run
461 <code>lvs</code> again. Explain why the used snapshot space did not
464 <li> It is possible to create snapshots of snapshots. This is
465 implemented by chaining together CoW tables. Describe the impact on
468 <li> Suppose a snapshot was created before significant modifications
469 were made to the contents of the LV, for example an upgrade of a large
470 software package. Assume that the user wishes to permanently return to
471 the old version because the upgrade did not work out. In this scenario
472 it is the snapshot which needs to be retained, rather than the original
473 LV. In view of this scenario, guess what happens on the attempt to
474 remove a LV which is being snapshotted. Unmount <code>/mnt/1</code>
475 and confirm by running <code>lvremove tvg/lv1</code>. </li>
477 <li> Come up with a suitable <code>lvconvert</code> command which
478 replaces the role of the LV and its snapshot. Explain why this solves
479 the "bad upgrade" problem outlined above. </li>
481 <li> Explain what happens if the CoW table fills up. Confirm by
482 writing a file larger than the snapshot size. </li>
486 SECTION(«Thin Provisioning»)
488 <p> The term "thin provisioning" is just a modern buzzword for
489 over-subscription. Both terms mean to give the appearance of having
490 more resources than are actually available. This is achieved by
491 on-demand allocation. The thin provisioning implementation of Linux
492 is implemented as a DM target called <em>dm-thin</em>. This code
493 first made its appearance in 2011 and was declared as stable two
494 years later. These days it should be safe for production use. </p>
496 <p> The general problem with thin provisioning is of course that bad
497 things happen when the resources are exhausted because the demand has
498 increased before new resources were added. For dm-thin this can happen
499 when users write to their allotted space, causing dm-thin to attempt
500 allocating a data block from a volume which is already full. This
501 usually leads to severe data corruption because file systems are
502 not really prepared to handle this error case and treat it as if the
503 underlying block device had failed. dm-thin does nothing to prevent
504 this, but one can configure a <em>low watermark</em>. When the
505 number of free data blocks drops below the watermark, a so-called
506 <em>dm-event</em> will be generated to notice the administrator. </p>
508 <p> One highlight of dm-thin is its efficient support for an arbitrary
509 depth of recursive snapshots, called <em>dm-thin snapshots</em>
510 in this document. With the traditional snapshot implementation,
511 recursive snapshots quickly become a performance issue as the depth
512 increases. With dm-thin one can have an arbitrary subset of all
513 snapshots active at any point in time, and there is no ordering
514 requirement on activating or removing them. </p>
516 <p> The block devices created by dm-thin always belong to a <em>thin
517 pool</em> which ties together two LVs called the <em>metadata LV</em>
518 and the <em>data LV</em>. The combined LV is called the <em>thin pool
519 LV</em>. Setting up a VG for thin provisioning is done in two steps:
520 First the standard LVs for data and the metatdata are created. Second,
521 the two LVs are combined into a thin pool LV. The second step hides
522 the two underlying LVs so that only the combined thin pool LV is
523 visible afterwards. Thin provisioned LVs and dm-thin snapshots can
524 then be created from the thin pool LV with a single command. </p>
526 <p> Another nice feature of dm-thin are <em>external snapshots</em>.
527 An external snapshot is one where the origin for a thinly provisioned
528 device is not a device of the pool. Arbitrary read-only block
529 devices can be turned into writable devices by creating an external
530 snapshot. Reads to an unprovisioned area of the snapshot will be passed
531 through to the origin. Writes trigger the allocation of new blocks as
532 usual with CoW. One use case for this is VM hosts which run their VMs
533 on thinly-provisioned volumes but have the base image on some "master"
534 device which is read-only and can hence be shared between all VMs. </p>
538 <p> Starting with the <code>tvg</code> VG, create and test a thin pool LV
539 by performing the following steps. The "Thin Usage" section of
540 <code>lvmthin(7)</code> will be helpful.
544 <li> Remove the <code>tlv1</code> and <code>tlv2</code> LVs. </li>
546 <li> Create a 5G data LV named <code>tdlv</code> (thin data LV)
547 and a 500M LV named <code>tmdlv</code> (thin metada LV). </li>
549 <li> Combine the two LVs into a thin pool with
550 <code>lvconvert</code>. Run <code>lvs -a</code> and explain the flags
551 listed below <code>Attr</code>. </li>
553 <li> Create a 10G thin LV named <code>oslv</code> (over-subscribed
556 <li> Create an XFS file system on <code>oslv</code> and mount it on
557 <code>/mnt</code>. </li>
559 <li> Run a loop of the form <code>for ((i = 0; i < 50; i++)): do
560 ... ; done</code> so that each iteration creates a 50M file named
561 <code>file-$i</code> and a snapshot named <code>snap_oslv-$i</code>
562 of <code>oslv</code>. </li>
564 <li> Activate an arbitrary snapshot with <code>lvchange -K</code> and
565 try to mount it. Explain what the error message means. Then read the
566 "XFS on snapshots" section of <code>lvmthin(7)</code>. </li>
568 <li> Check the available space of the data LV with <code>lvs
569 -a</code>. Mount one snapshot (specifying <code>-o nouuid</code>)
570 and run <code>lvs -a</code> again. Why did the free space decrease
571 although no new files were written? </li>
573 <li> Mount four different snapshots and check that they contain the
574 expected files. </li>
576 <li> Remove all snapshots. Guess what <code>lvs -a</code> and <code>dh
577 -h /mnt</code> report. Then run the commands to confirm. Guess
578 what happens if you try to create another 3G file? Confirm
579 your guess, then read the section on "Data space exhaustion" of
580 <code>lvmthin(7)</code>. </li>
586 When a thin pool provisions a new data block for a thin LV, the new
587 block is first overwritten with zeros by default. Discuss why this
588 is done, its impact on performance and security, and conclude whether
589 or not it is a good idea to turn off the zeroing.
593 SECTION(«Bcache, dm-cache and dm-writecache»)
595 <p> All three implementations named in the title of this chapter are <em>
596 Linux block layer caches</em>. They combine two different block
597 devices to form a hybrid block device which dynamically caches
598 and migrates data between the two devices with the aim to improve
599 performance. One device, the <em> backing device</em>, is expected
600 to be large and slow while the other one, the <em>cache device</em>,
601 is expected to be small and fast. </p>
604 define(«bch_width», «300»)
605 define(«bch_height», «130»)
606 define(«bch_margin», «10»)
607 define(«bch_rraid_width», «eval((bch_width() - 4 * bch_margin()) * 4 / 5)»)
608 define(«bch_raidbox_height», «eval(bch_height() - 2 * bch_margin())»)
609 define(«bch_nraid_width», «eval(bch_rraid_width() / 4)»)
610 define(«bch_rdisk_width», «eval((bch_width() - 3 * bch_margin()) * 18 / 100)»)
611 define(«bch_rdisk_height», «eval((bch_height() - 4 * bch_margin()) / 3)»)
612 define(«bch_ndisk_width», «eval(bch_rdisk_width() / 2)»)
613 define(«bch_ndisk_height», «eval(bch_raidbox_height() - 5 * bch_margin())»)
614 define(«bch_rdisk», «svg_disk(«$1», «$2»,
615 «bch_rdisk_width()», «bch_rdisk_height()», «#666»)»)
616 define(«bch_ndisk», «svg_disk(«$1», «$2»,
617 «bch_ndisk_width()», «bch_ndisk_height()», «#66f»)»)
618 define(«bch_5rdisk», «
619 bch_rdisk(«$1», «$2»)
620 bch_rdisk(«eval($1 + bch_margin())»,
621 «eval($2 + bch_margin())»)
622 bch_rdisk(«eval($1 + 2 * bch_margin())»,
623 «eval($2 + 2 * bch_margin())»)
624 bch_rdisk(«eval($1 + 3 * bch_margin())»,
625 «eval($2 + 3 * bch_margin())»)
626 bch_rdisk(«eval($1 + 4 * bch_margin())»,
627 «eval($2 + 4 * bch_margin())»)
630 define(«bch_rraid», «
636 width="bch_rraid_width()"
637 height="bch_raidbox_height()"
640 bch_5rdisk(«eval($1 + bch_margin())»,
641 «eval($2 + 2 * bch_margin())»)
642 bch_5rdisk(«eval($1 + 2 * bch_rdisk_width() + bch_margin())»,
643 «eval($2 + 2 * bch_margin())»)
645 define(«bch_nraid», «
651 width="bch_nraid_width()"
652 height="bch_raidbox_height()"
655 bch_ndisk(eval($1 + bch_margin()),
656 eval($2 + 2 * bch_margin()))
657 bch_ndisk(eval($1 + 2 * bch_margin()),
658 eval($2 + 3 * bch_margin()))
662 width="bch_width()" height="bch_height()"
663 xmlns="http://www.w3.org/2000/svg"
664 xmlns:xlink="http://www.w3.org/1999/xlink"
672 width="eval(bch_rraid_width() + bch_nraid_width()
673 + 3 * bch_margin() - 2)"
674 height="eval(bch_raidbox_height() + 2 * bch_margin() - 2)"
677 bch_nraid(«bch_margin()», «bch_margin()»)
678 bch_rraid(«eval(2 * bch_margin() + bch_nraid_width())», «bch_margin()»)
682 <p> The most simple setup consists of a single rotating disk and one SSD.
683 The setup shown in the diagram at the left is realistic for a large
684 server with redundant storage. In this setup the hybrid device
685 (yellow) combines a raid6 array (green) consisting of many rotating
686 disks (grey) with a two-disk raid1 array (orange) stored on fast
687 NVMe devices (blue). In the simple setup it is always a win when
688 I/O is performed from/to the SSD instead of the rotating disk. In
689 the server setup, however, it depends on the workload which device
690 is faster. Given enough rotating disks and a streaming I/O workload,
691 the raid6 outperforms the raid1 because all disks can read or write
694 <p> Since block layer caches hook into the Linux block API described <a
695 href="«#»the_linux_block_layer">earlier</a>, the hybrid block devices
696 they provide can be used like any other block device. In particular,
697 the hybrid devices are <em> file system agnostic</em>, meaning that
698 any file system can be created on them. In what follows we briefly
699 describe the differences between the three block layer caches and
700 conclude with the pros and cons of each. </p>
702 <p> Bcache is a stand-alone stacking device driver which was
703 included in the Linux kernel in 2013. According to the <a
704 href="https://bcache.evilpiepirate.org/">bcache home page</a>, it
705 is "done and stable". dm-cache and dm-writecache are device mapper
706 targets included in 2013 and 2018, respectively, which are both marked
707 as experimental. In contrast to dm-cache, dm-writecache only caches
708 writes while reads are supposed to be cached in RAM. It has been
709 designed for programs like databases which need low commit latency.
710 Both bcache and dm-cache can operate in writeback or writethrough
711 mode while dm-writecache always operates in writeback mode. </p>
713 <p> The DM-based caches are designed to leave the decision as to what
714 data to migrate (and when) to user space while bcache has this policy
715 built-in. However, at this point only the <em> Stochastic Multiqueue
716 </em> (smq) policy for dm-cache exists, plus a second policy which
717 is only useful for decommissioning the cache device. There are no
718 tunables for dm-cache while all the bells and whistles of bcache can
719 be configured through sysfs files. Another difference is that bcache
720 detects sequential I/O and separates it from random I/O so that large
721 streaming reads and writes bypass the cache and don't push cached
722 randomly accessed data out of the cache. </p>
724 <p> bcache is the clear winner of this comparison because it is stable,
725 configurable and performs better at least on the server setup
726 described above because it separate random and sequential I/O. The
727 only advantage of dm-cache is its flexibility because cache policies
728 can be switched. But even this remains a theoretical advantage as
729 long as only a single policy for dm-cache exists. </p>
735 <li> Recall the concepts of writeback and writethrough and explain
736 why writeback is faster and writethrough is safer. </li>
738 <li> Explain how the <em>writearound</em> mode of bcache works and
739 when it should be used. </li>
741 <li> Setup a bcache device from two loop devices. </li>
743 <li> Create a file system of a bcache device and mount it. Detach
744 the cache device while the file system is mounted. </li>
746 <li> Setup a dm-cache device from two loop devices. </li>
748 <li> Setup a thin pool where the data LV is a dm-cache device.</li>
750 <li> Explain the point of dm-cache's <em>passthrough</em> mode.</li>
756 Explain why small writes to a file system which is stored on a
757 parity raid result in read-modify-write (RMW) updates. Explain why
758 RMW updates are particularly expensive and how raid implementations
759 and block layer caches try to avoid them.
765 Recall the concepts of writeback and writethrough. Describe what
766 each mode means for a hardware device and for a bcache/dm-cache
767 device. Explain why writeback is faster and writethrough is safer.
773 TRIM and UNMAP are special commands in the ATA/SCSI command sets
774 which inform an SSD that certain data blocks are no longer in use,
775 allowing the SSD to re-use these blocks to increase performance and
776 to reduce wear. Subsequent reads from the trimmed data blocks will
777 not return any meaningful data. For example, the <code> mkfs </code>
778 commands sends this command to discard all blocks of the device.
779 Discuss the implications when <code> mkfs. </code> is run on a device
780 provided by bcache or dm-cache.
784 SECTION(«The dm-crypt Target»)
786 <p> This device mapper target provides encryption of arbitrary block
787 devices by employing the primitives of the crypto API of the Linux
788 kernel. This API provides a uniform interface to a large number of
789 cipher algorithms which have been implemented with performance and
790 security in mind. </p>
792 <p> The cipher algorithm of choice for the encryption of block devices
793 is the <em> Advanced Encryption Standard </em> (AES), also known
794 as <em> Rijndael</em>, named after the two Belgian cryptographers
795 Rijmen and Daemen who proposed the algorithm in 1999. AES is a <em>
796 symmetric block cipher</em>. That is, a transformation which operates
797 on fixed-length blocks and which is determined by a single key for both
798 encryption and decryption. The underlying algorithm is fairly simple,
799 which makes AES perform well in both hardware and software. Also
800 the key setup time and the memory requirements are excellent. Modern
801 processors of all manufacturers include instructions to perform AES
802 operations in hardware, improving speed and security. </p>
804 <p> According to the Snowden documents, the NSA has been doing research
805 on breaking AES for a long time without being able to come up with
806 a practical attack for 256 bit keys. Successful attacks invariably
807 target the key management software instead, which is often implemented
808 poorly, trading security for user-friendliness, for example by
809 storing passwords weakly encrypted, or by providing a "feature"
810 which can decrypt the device without knowing the password. </p>
812 <p> The exercises of this section ask the reader to encrypt a loop device
813 with AES without relying on any third party key management software </p>.
817 <li> Discuss the message of this <a
818 href="https://xkcd.com/538/">xkcd</a> comic. </li>
820 <li> How can a hardware implementation of an algorithm like AES
821 improve security? After all, it is the same algorithm that is
824 <li> What's the point of the <a href="#random_stream">rstream.c</a>
825 program below which writes random data to stdout? Doesn't <code>
826 cat /dev/urandom </code> do the same? </li>
828 <li> Compile and run <a href="#random_stream">rstream.c</a> to create
829 a 10G local file and create the loop device <code> /dev/loop0 </code>
832 <li> A <em> table </em> for the <code> dmsetup(8) </code> command is
833 a single line of the form <code> start_sector num_sectors target_type
834 target_args</code>. Determine the correct values for the first three
835 arguments to encrypt <code> /dev/loop0</code>. </li>
837 <li> The <code>target_args</code> for the dm-crypt target are
838 of the form <code>cipher key iv_offset device offset</code>. To
839 encrypt <code>/dev/loop0</code> with AES-256, <code>cipher</code>
840 is <code>aes</code>, <code>device</code> is <code>/dev/loop0</code>
841 and both offsets are zero. Come up with an idea to create a 256 bit
842 key from a passphrase. </li>
844 <li> The <code> create </code> subcommand of <code> dmsetup(8)
845 </code> creates a device from the given table. Run a command of
846 the form <code> echo "$table" | dmsetup create cryptdev </code>
847 to create the encrypted device <code> /dev/mapper/cryptdev </code>
848 from the loop device. </li>
850 <li> Create a file system on <code> /dev/mapper/cryptdev</code>,
851 mount it and create the file <code> passphrase </code> containing
852 the string "super-secret" on this file system. </li>
854 <li> Unmount the <code> cryptdev </code> device and run <code> dmsetup
855 remove cryptdev</code>. Run <code> strings </code> on the loop device
856 and on the underlying file to see if it contains the string <code>
857 super-secret" </code> or <code> passphrase</code>. </li>
859 <li> Re-create the <code> cryptdev </code> device, but this time use
860 a different (hence invalid) key. Guess what happens and confirm. </li>
862 <li> Write a script which disables echoing (<code>stty -echo</code>),
863 reads a passphrase from stdin and combines the above steps to create
864 and mount an encrypted device. </li>
870 Why is it a good idea to overwrite a block device with random data
871 before it is encrypted?
877 The dm-crypt target encrypts whole block devices. An alternative is
878 to encrypt on the file system level. That is, each file is encrypted
879 separately. Discuss the pros and cons of both approaches.
885 SUBSECTION(«Random stream»)
889 /* Link with -lcrypto */
890 #include <openssl/rand.h>
891 #include <stdio.h>
892 #include <unistd.h>
893 #include <stdio.h>
895 int main(int argc, char **argv)
897 unsigned char buf[1024 * 1024];
900 int ret = RAND_bytes(buf, sizeof(buf));
903 fprintf(stderr, "RAND_bytes() error\n");
906 ret = write(STDOUT_FILENO, buf, sizeof(buf));