LVM.m4

   1 TITLE(«
   2
   3         Who the heck is General Failure, and why is he reading my disk? -- Unknown
   4
   5 », __file__)
   6
   7 OVERVIEW(«
   8
   9 The idea of Logical Volume Management is to decouple data and
  10 storage. This offers great flexibility in managing storage and reduces
  11 server downtimes because the storage may be replaced while file
  12 systems are mounted read-write and applications are actively using
  13 them. This chapter provides an introduction to the Linux block layer
  14 and LVM. Subsequent sections cover selected device mapper targets.
  15
  16 »)
  17
  18 SECTION(«The Linux Block Layer»)
  19
  20 <p> The main task of LVM is the management of block devices, so it is
  21 natural to start an introduction to LVM with a section on the Linux
  22 block layer, which is the central component in the Linux kernel
  23 for the handling of persistent storage devices. The mission of the
  24 block layer is to provide a uniform interface to different types
  25 of storage devices. The obvious in-kernel users of this interface
  26 are the file systems and the swap subsystem. But also <em> stacking
  27 device drivers </em> like LVM, Bcache and MD access block devices
  28 through this interface to create virtual block devices from other block
  29 devices. Some user space programs (<code>fdisk, dd, mkfs, ...</code>)
  30 also need to access block devices. The block layer allows them to
  31 perform their task in a well-defined and uniform manner through
  32 block-special device files. </p>
  33
  34 <p> The userspace programs and the in-kernel users interact with the block
  35 layer by sending read or write requests. A <em>bio</em> is the central
  36 data structure that carries such requests within the kernel. Bios
  37 may contain an arbitrary amount of data. They are given to the block
  38 layer to be queued for subsequent handling. Often a bio has to travel
  39 through a stack of block device drivers where each driver modifies
  40 the bio and sends it on to the next driver. Typically, only the last
  41 driver in the stack corresponds to a hardware device. </p>
  42
  43 <p> Besides requests to read or write data blocks, there are various other
  44 bio requests that carry SCSI commands like FLUSH, FUA (Force Unit
  45 Access), TRIM and UNMAP. FLUSH and FUA ensure that certain data hits
  46 stable storage. FLUSH asks the the device to write out the contents of
  47 its volatile write cache while a FUA request carries data that should
  48 be written directly to the device, bypassing all caches. UNMAP/TRIM is
  49 a SCSI/ATA command which is only relevant to SSDs. It is a promise of
  50 the OS to not read the given range of blocks any more, so the device
  51 is free to discard the contents and return arbitrary data on the
  52 next read. This helps the device to level out the number of times
  53 the flash storage cells are overwritten (<em>wear-leveling</em>),
  54 which improves the durability of the device. </p>
  55
  56 <p> The first task of the block layer is to split incoming bios if
  57 necessary to make them conform to the size limit or the alignment
  58 requirements of the target device, and to batch and merge bios so that
  59 they can be submitted as a unit for performance reasons. The thusly
  60 processed bios then form an I/O request which is handed to an <em>
  61 I/O scheduler </em> (also known as <em> elevator</em>). </p>
  62
  63 <p> Traditionally, the schedulers were designed for rotating disks.
  64 They implemented a single request queue and reordered the queued
  65 I/O requests with the aim to minimize disk seek times. The newer
  66 multi-queue schedulers mq-deadline, kyber, and bfq (budget fair
  67 queueing) aim to max out even the fastest devices. As implied by
  68 the name "multi-queue", they implement several request queues,
  69 the number of which depends on the hardware in use. This has become
  70 necessary because modern storage hardware allows multiple requests
  71 to be submitted in parallel from different CPUs. Moreover, with many
  72 CPUs the locking overhead required to put a request into a queue
  73 increases. Per-CPU queues allow for per-CPU locks, which decreases
  74 queue lock contention. </p>
  75
  76 <p> We will take a look at some aspects of the Linux block layer and on
  77 the various I/O schedulers. An exercise on loop devices enables the
  78 reader to create block devices for testing. This will be handy in
  79 the subsequent sections on LVM specific topics. </p>
  80
  81 EXERCISES()
  82
  83 <ul>
  84
  85         <li> Run <code>find /dev -type b</code> to get the list of all block
  86         devices on your system. Explain which is which. </li>
  87
  88         <li> Examine the files in <code>/sys/block/sda</code>, in
  89         particular <code>/sys/block/sda/stat</code>. Search the web for
  90         <code>Documentation/block/stat.txt</code> for the meaning of the
  91         numbers shown. Then run <code>iostat -xdh sda 1</code>. </li>
  92
  93         <li> Examine the files in <code>/sys/block/sda/queue</code>. </li>
  94
  95         <li> Find out how to determine the size of a block device. </li>
  96
  97         <li> Figure out a way to identify the name of all block devices which
  98         correspond to SSDs (i.e., excluding any rotating disks). </li>
  99
 100         <li> Run <code>lsblk</code> and discuss
 101         the output. Too easy? Run <code>lsblk -o
 102         KNAME,PHY-SEC,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,RQ-SIZE,ROTA,SCHED</code>
 103         </li>
 104
 105         <li> What's the difference between a task scheduler and an I/O
 106         scheduler? </li>
 107
 108         <li> Why are I/O schedulers also called elevators? </li>
 109
 110         <li> How can one find out which I/O schedulers are supported on a
 111         system and which scheduler is active for a given block device? </li>
 112
 113         <li> Is it possible (and safe) to change the I/O scheduler for a
 114         block device while it is in use? If so, how can this be done? </li>
 115
 116         <li> The loop device driver of the Linux kernel allows privileged
 117         users to create a block device from a regular file stored on a file
 118         system.  The resulting block device is called a <em>loop</em> device.
 119         Create a 1G large temporary file containing only zeroes. Run a suitable
 120         <code>losetup(8)</code> command to create a loop device from the
 121         file. Create an XFS file system on the loop device and mount it. </li>
 122
 123 </ul>
 124
 125 HOMEWORK(«
 126
 127 <ul>
 128         <li> Come up with three different use cases for loop devices. </li>
 129
 130         <li> Given a block device node in <code> /dev</code>, how can one
 131         tell that it is a loop device? </li>
 132
 133         <li> Describe the connection between loop devices created by
 134         <code>losetup(8)</code> and the loopback device used for network
 135         connections from the machine to itself. </li>
 136
 137 </ul>
 138 »)
 139
 140 define(«svg_disk», «
 141         <g
 142                 fill="$5"
 143                 stroke="black"
 144                 stroke-width="1"
 145         >
 146         <ellipse
 147                 cx="eval($1 + $3 / 2)"
 148                 cy="eval($2 + $4)"
 149                 rx="eval($3 / 2)"
 150                 ry="eval($3 / 4)"
 151         />
 152         <rect
 153                 x="$1"
 154                 y="$2"
 155                 width="$3"
 156                 height="$4"
 157         />
 158         <rect
 159                 x="eval($1 + 1)"
 160                 y="eval($2 + $4 - 1)"
 161                 width="eval($3 - 2)"
 162                 height="2"
 163                 stroke="$5"
 164         />
 165         <ellipse
 166                 cx="eval($1 + $3 / 2)"
 167                 cy="$2"
 168                 rx="eval($3 / 2)"
 169                 ry="eval($3 / 4)"
 170         />
 171         </g>
 172 »)
 173
 174 SECTION(«Physical and Logical Volumes, Volume Groups»)
 175
 176 <p> Getting started with the Logical Volume Manager (LVM) requires to
 177 get used to a minimal set of vocabulary. This section introduces
 178 the words named in the title of the section, and a couple more.
 179 The basic concepts of LVM are then described in terms of these words. </p>
 180
 181 <div>
 182 define(lvm_width», «300»)
 183 define(«lvm_height», «183»)
 184 define(«lvm_margin», «10»)
 185 define(«lvm_extent_size», «10»)
 186 define(«lvm_extent», «
 187         <rect
 188                 fill="$1"
 189                 x="$2"
 190                 y="$3"
 191                 width="lvm_extent_size()"
 192                 height="lvm_extent_size()"
 193                 stroke="black"
 194                 stroke-width="1"
 195         />
 196 »)
 197 dnl $1: color, $2: x, $3: y, $4: number of extents
 198 define(«lvm_extents», «
 199         ifelse(«$4», «0», «», «
 200                 lvm_extent(«$1», «$2», «$3»)
 201                 lvm_extents(«$1», eval($2 + lvm_extent_size() + lvm_margin()),
 202                         «$3», eval($4 - 1))
 203         »)
 204 »)
 205 dnl $1: x, $2: y, $3: number of extents, $4: disk color, $5: extent color
 206 define(«lvm_disk», «
 207         ifelse(eval(«$3» > 3), «1», «
 208                 pushdef(«h», «eval(7 * lvm_extent_size())»)
 209                 pushdef(«w», «eval(($3 + 1) * lvm_extent_size())»)
 210         », «
 211                 pushdef(«h», «eval(3 * lvm_extent_size() + lvm_margin())»)
 212                 pushdef(«w», «eval($3 * lvm_extent_size() * 2)»)
 213         »)
 214         svg_disk(«$1», «$2», «w()», «h()», «$4»)
 215         ifelse(eval(«$3» > 3), «1», «
 216                 pushdef(«n1», eval(«$3» / 2))
 217                 pushdef(«n2», eval(«$3» - n1()))
 218                 lvm_extents(«$5»,
 219                         eval(«$1» + (w() - (2 * n1() - 1) * lvm_extent_size()) / 2),
 220                         eval(«$2» + h() / 2 - lvm_extent_size()), «n1()»)
 221                 lvm_extents(«$5»,
 222                         eval(«$1» + (w() - (2 * n2() - 1) * lvm_extent_size()) / 2),
 223                         eval(«$2» + h() / 2 + 2 * lvm_extent_size()), «n2()»)
 224                 popdef(«n1»)
 225                 popdef(«n2»)
 226         », «
 227                 lvm_extents(«$5»,
 228                         eval(«$1» + (w() - (2 * «$3» - 1) * lvm_extent_size()) / 2),
 229                         eval(«$2» + h() / 2), «$3»)
 230         »)
 231         popdef(«w»)
 232         popdef(«h»)
 233 »)
 234 <svg
 235         width="lvm_width()" height="lvm_height()"
 236         xmlns="http://www.w3.org/2000/svg"
 237         xmlns:xlink="http://www.w3.org/1999/xlink"
 238 >
 239         <rect
 240                 x=1
 241                 y=1
 242                 width="140"
 243                 height="180"
 244                 fill="green"
 245                 rx="10"
 246                 stroke-width="1"
 247                 stroke="black"
 248         />
 249         lvm_disk(«20», «20», «2», «#666», «yellow»)
 250         lvm_disk(«10», «90», «4», «#666», «yellow»)
 251         lvm_disk(«70», «55», «5», «#666», «yellow»)
 252         <path
 253                 d="
 254                         M 155 91
 255                         l 20 0
 256                         m 0 0
 257                         l -4 -3
 258                         l 0 6
 259                         l 4 -3
 260                         z
 261                 "
 262                 stroke-width="4"
 263                 stroke="black"
 264                 fill="black"
 265         />
 266         lvm_disk(«190», «22», «7», «#66f», «orange»)
 267         lvm_disk(«220», «130», «1», «#66f», «orange»)
 268 </svg>
 269 </div>
 270
 271 <p> A <em> Physical Volume</em> (PV, grey) is an arbitrary block device which
 272 contains a certain metadata header (also known as <em>superblock</em>)
 273 at the start. PVs can be partitions on a local hard disk or a SSD,
 274 a soft- or hardware raid, or a loop device. LVM does not care.
 275 The storage space on a physical volume is managed in units called <em>
 276 Physical Extents </em> (PEs, yellow). The default PE size is 4M. </p>
 277
 278 <p> A <em>Volume Group</em> (VG, green) is a non-empty set of PVs with
 279 a name and a unique ID assigned to it. A PV can but doesn't need to
 280 be assigned to a VG. If it is, the ID of the associated VG is stored
 281 in the metadata header of the PV. </p>
 282
 283 <p> A <em> Logical Volume</em> (LV, blue) is a named block device which is
 284 provided by LVM. LVs are always associated with a VG and are stored
 285 on that VG's PVs. Since LVs are normal block devices, file systems
 286 of any type can be created on them, they can be used as swap storage,
 287 etc. The chunks of a LV are managed as <em>Logical Extents</em> (LEs,
 288 orange). Often the LE size equals the PE size.  For each LV there is
 289 a mapping between the LEs of the LV and the PEs of the underlying
 290 PVs. The PEs can spread multiple PVs. </p>
 291
 292 <p> VGs can be extended by adding additional PVs to it, or reduced by
 293 removing unused devices, i.e., those with no PEs allocated on them. PEs
 294 may be moved from one PV to another while the LVs are active. LVs
 295 may be grown or shrunk. To grow a LV, there must be enough space
 296 left in the VG. Growing a LV does not magically grow the file system
 297 stored on it, however. To make use of the additional space, a second,
 298 file system specific step is needed to tell the file system that it's
 299 underlying block device (the LV) has grown. </p>
 300
 301 <p> The exercises of this section illustrate the basic LVM concepts
 302 and the essential LVM commands. They ask the reader to create a VG
 303 whose PVs are loop devices. This VG is used as a starting point in
 304 subsequent chapters. </p>
 305
 306 EXERCISES()
 307
 308 <ul>
 309
 310         <li> Create two 5G large loop devices <code>/dev/loop1</code>
 311         and <code>/dev/loop2</code>. Make them PVs by running
 312         <code>pvcreate</code>. Create a VG <code>tvg</code> (test volume group)
 313         from the two loop devices and two 3G large LVs named <code>tlv1</code>
 314         and <code>tlv2</code> on it. Run the <code>pvcreate, vgcreate</code>,
 315         and <code>lvcreate</code> commands with <code>-v</code> to activate
 316         verbose output and try to understand each output line. </li>
 317
 318         <li> Run <code>pvs, vgs, lvs, lvdisplay, pvdisplay</code> and examine
 319         the output. </li>
 320
 321         <li> Run <code>lvdisplay -m</code> to examine the mapping of logical
 322         extents to PVs and physical extents. </li>
 323
 324         <li> Run <code>pvs --segments -o+lv_name,seg_start_pe,segtype</code>
 325         to see the map between physical extents and logical extents. </li>
 326
 327 </ul>
 328
 329 HOMEWORK(«
 330
 331 In the above scenario (two LVs in a VG consisting of two PVs), how
 332 can you tell whether both PVs are actually used? Remove the LVs
 333 with <code>lvremove</code>. Recreate them, but this time use the
 334 <code>--stripes 2</code> option to <code>lvcreate</code>. Explain
 335 what this option does and confirm with a suitable command.
 336
 337 »)
 338
 339 SECTION(«Device Mapper and Device Mapper Targets»)
 340
 341 <p> The kernel part of the Logical Volume Manager (LVM) is called
 342 <em>device mapper</em> (DM), which is a generic framework to map
 343 one block device to another. Applications talk to the Device Mapper
 344 via the <em>libdevmapper</em> library, which issues requests
 345 to the <code>/dev/mapper/control</code> character device using the
 346 <code>ioctl(2)</code> system call. The device mapper is also accessible
 347 from scripts via the <code>dmsetup(8)</code> tool. </p>
 348
 349 <p> A DM target represents one particular mapping type for ranges
 350 of LEs. Several DM targets exist, each of which which creates and
 351 maintains block devices with certain characteristics. In this section
 352 we take a look at the <code>dmsetup</code> tool and the relatively
 353 simple <em>mirror</em> target. Subsequent sections cover other targets
 354 in more detail. </p>
 355
 356 EXERCISES()
 357
 358 <ul>
 359
 360         <li> Run <code>dmsetup targets</code> to list all targets supported
 361         by the currently running kernel. Explain their purpose and typical
 362         use cases. </li>
 363
 364         <li> Starting with the <code>tvg</code> VG, remove <code>tlv2</code>.
 365         Convince yourself by running <code>vgs</code> that <code>tvg</code>
 366         is 10G large, with 3G being in use. Run <code>pvmove
 367         /dev/loop1</code> to move the used PEs of <code>/dev/loop1</code>
 368         to <code>/dev/loop2</code>. After the command completes, run
 369         <code>pvs</code> again to see that <code>/dev/loop1</code> has no
 370         more PEs in use. </li>
 371
 372         <li> Create a third 5G loop device <code>/dev/loop3</code>, make it a
 373         PV and extend the VG with <code>vgextend tvg /dev/loop3</code>. Remove
 374         <code>tlv1</code>. Now the LEs of <code>tlv2</code> fit on any
 375         of the three PVs.  Come up with a command which moves them to
 376         <code>/dev/loop3</code>. </li>
 377
 378         <li> The first two loop devices are both unused. Remove them from
 379         the VG with <code>vgreduce -a</code>. Why are they still listed in
 380         the <code>pvs</code> output? What can be done about that? </li>
 381
 382 </ul>
 383
 384 HOMEWORK(«
 385
 386 As advertised in the introduction, LVM allows the administrator to
 387 replace the underlying storage of a file system online. This is done
 388 by running a suitable <code>pvmove(8)</code> command to move all PEs of
 389 one PV to different PVs in the same VG.
 390
 391 <ul>
 392
 393         <li> Explain the mapping type of dm-mirror. </li>
 394
 395         <li> The traditional way to mirror the contents of two or more block
 396         devices is software raid 1, also known as <em>md raid1</em> ("md"
 397         is short for multi-disk). Explain the difference between md raid1,
 398         the dm-raid target which supports raid1 and other raid levels, and
 399         the dm-mirror target. </li>
 400
 401         <li> Guess how <code>pvmove</code> is implemented on top of
 402         dm-mirror. Verify your guess by reading the "NOTES" section of the
 403         <code>pvmove(8)</code> man page. </li>
 404
 405 </ul>
 406 »)
 407
 408 SECTION(«LVM Snapshots»)
 409
 410 <p> LVM snapshots are based on the CoW optimization strategy described
 411 earlier in the chapter on <a href="./Unix_Concepts.html#processes">Unix
 412 Concepts</a>. Creating a snapshot means to create a CoW table of the
 413 given size. Just before a LE of a snapshotted LV is about to be written
 414 to, its contents are copied to a free slot in the CoW table. This
 415 preserves an old version of the LV, the snapshot, which can later be
 416 reconstructed by overlaying the CoW table atop the LV. </p>
 417
 418 <p> Snapshots can be taken from a LV which contains a mounted file system,
 419 while applications are actively modifying files. Without coordination
 420 between the file system and LVM, the file system most likely has memory
 421 buffers scheduled for writeback. These outstanding writes did not make
 422 it to the snapshot, so one can not expect the snapshot to contain a
 423 consistent file system image. Instead, it is in a similar state as a
 424 regular device after an unclean shutdown. This is not a problem for
 425 XFS and EXT4, as both are <em>journalling</em> file systems, which
 426 were designed with crash recovery in mind. At the next mount after a
 427 crash, journalling file systems replay their journal, which results
 428 in a consistent state. Note that this implies that even a read-only
 429 mount of the snapshot device has to write to the device. </p>
 430
 431 EXERCISES()
 432
 433 <ul>
 434
 435         <li> In the test VG, create a 1G large snapshot named
 436         <code>snap_tlv1</code> of the <code>tlv1</code> VG by using the
 437         <code>-s</code> option to <code>lvcreate(8)</code>. Predict how much
 438         free space is left in the VG. Confirm with <code>vgs tvg</code>. </li>
 439
 440         <li> Create an EXT4 file system on <code>tlv1</code> by running
 441         <code>mkfs.ext4 /dev/tvg/lv1</code>. Guess how much of the snapshot
 442         space has been allocated by this operation. Check with <code>lvs
 443         tvg1/snap_lv1</code>. </li>
 444
 445         <li> Remove the snapshot with <code>lvremove</code> and recreate
 446         it. Repeat the previous step, but this time run <code>mkfs.xfs</code>
 447         to create an XFS file system. Run <code>lvs tvg/snap_lv1</code>
 448         again and compare the used snapshot space to the EXT4 case. Explain
 449         the difference. </li>
 450
 451         <li> Remove the snapshot and recreate it so that both <code>tlv1</code>
 452         and <code>snap_tlv1</code> contain a valid XFS file system. Mount
 453         the file systems on <code>/mnt/1</code> and <code>/mnt/2</code>. </li>
 454
 455         <li> Run <code>dd if=/dev/zero of=/mnt/1/zero count=$((2 * 100 *
 456         1024))</code> to create a 100M large file on <code>tlv1</code>. Check
 457         that <code>/mnt/2</code> is still empty. Estimate how much of the
 458         snapshot space is used and check again. </li>
 459
 460         <li> Repeat the above <code>dd</code> command 5 times and run
 461         <code>lvs</code> again. Explain why the used snapshot space did not
 462         increase. </li>
 463
 464         <li> It is possible to create snapshots of snapshots. This is
 465         implemented by chaining together CoW tables. Describe the impact on
 466         performance. </li>
 467
 468         <li> Suppose a snapshot was created before significant modifications
 469         were made to the contents of the LV, for example an upgrade of a large
 470         software package. Assume that the user wishes to permanently return to
 471         the old version because the upgrade did not work out. In this scenario
 472         it is the snapshot which needs to be retained, rather than the original
 473         LV.  In view of this scenario, guess what happens on the attempt to
 474         remove a LV which is being snapshotted. Unmount <code>/mnt/1</code>
 475         and confirm by running <code>lvremove tvg/lv1</code>. </li>
 476
 477         <li> Come up with a suitable <code>lvconvert</code> command which
 478         replaces the role of the LV and its snapshot. Explain why this solves
 479         the "bad upgrade" problem outlined above. </li>
 480
 481         <li> Explain what happens if the CoW table fills up. Confirm by
 482         writing a file larger than the snapshot size. </li>
 483
 484 </ul>
 485
 486 SECTION(«Thin Provisioning»)
 487
 488 <p> The term "thin provisioning" is just a modern buzzword for
 489 over-subscription. Both terms mean to give the appearance of having
 490 more resources than are actually available. This is achieved by
 491 on-demand allocation. The thin provisioning implementation of Linux
 492 is implemented as a DM target called <em>dm-thin</em>. This code
 493 first made its appearance in 2011 and was declared as stable two
 494 years later. These days it should be safe for production use. </p>
 495
 496 <p> The general problem with thin provisioning is of course that bad
 497 things happen when the resources are exhausted because the demand has
 498 increased before new resources were added. For dm-thin this can happen
 499 when users write to their allotted space, causing dm-thin to attempt
 500 allocating a data block from a volume which is already full. This
 501 usually leads to severe data corruption because file systems are
 502 not really prepared to handle this error case and treat it as if the
 503 underlying block device had failed. dm-thin does nothing to prevent
 504 this, but one can configure a <em>low watermark</em>.  When the
 505 number of free data blocks drops below the watermark, a so-called
 506 <em>dm-event</em> will be generated to notice the administrator. </p>
 507
 508 <p> One highlight of dm-thin is its efficient support for an arbitrary
 509 depth of recursive snapshots, called <em>dm-thin snapshots</em>
 510 in this document. With the traditional snapshot implementation,
 511 recursive snapshots quickly become a performance issue as the depth
 512 increases. With dm-thin one can have an arbitrary subset of all
 513 snapshots active at any point in time, and there is no ordering
 514 requirement on activating or removing them. </p>
 515
 516 <p> The block devices created by dm-thin always belong to a <em>thin
 517 pool</em> which ties together two LVs called the <em>metadata LV</em>
 518 and the <em>data LV</em>. The combined LV is called the <em>thin pool
 519 LV</em>. Setting up a VG for thin provisioning is done in two steps:
 520 First the standard LVs for data and the metatdata are created. Second,
 521 the two LVs are combined into a thin pool LV. The second step hides
 522 the two underlying LVs so that only the combined thin pool LV is
 523 visible afterwards. Thin provisioned LVs and dm-thin snapshots can
 524 then be created from the thin pool LV with a single command. </p>
 525
 526 <p> Another nice feature of dm-thin are <em>external snapshots</em>.
 527 An external snapshot is one where the origin for a thinly provisioned
 528 device is not a device of the pool. Arbitrary read-only block
 529 devices can be turned into writable devices by creating an external
 530 snapshot. Reads to an unprovisioned area of the snapshot will be passed
 531 through to the origin. Writes trigger the allocation of new blocks as
 532 usual with CoW. One use case for this is VM hosts which run their VMs
 533 on thinly-provisioned volumes but have the base image on some "master"
 534 device which is read-only and can hence be shared between all VMs. </p>
 535
 536 EXERCISES()
 537
 538 <p> Starting with the <code>tvg</code> VG, create and test a thin pool LV
 539 by performing the following steps.  The "Thin Usage" section of
 540 <code>lvmthin(7)</code> will be helpful.
 541
 542 <ul>
 543
 544         <li> Remove the <code>tlv1</code> and <code>tlv2</code> LVs. </li>
 545
 546         <li> Create a 5G data LV named <code>tdlv</code> (thin data LV)
 547         and a 500M LV named <code>tmdlv</code> (thin metada LV). </li>
 548
 549         <li> Combine the two LVs into a thin pool with
 550         <code>lvconvert</code>. Run <code>lvs -a</code> and explain the flags
 551         listed below <code>Attr</code>. </li>
 552
 553         <li> Create a 10G thin LV named <code>oslv</code> (over-subscribed
 554         LV). </li>
 555
 556         <li> Create an XFS file system on <code>oslv</code> and mount it on
 557         <code>/mnt</code>. </li>
 558
 559         <li> Run a loop of the form <code>for ((i = 0; i &lt; 50; i++)): do
 560         ... ; done</code> so that each iteration creates a 50M file named
 561         <code>file-$i</code> and a snapshot named <code>snap_oslv-$i</code>
 562         of <code>oslv</code>. </li>
 563
 564         <li> Activate an arbitrary snapshot with <code>lvchange -K</code> and
 565         try to mount it. Explain what the error message means. Then read the
 566         "XFS on snapshots" section of <code>lvmthin(7)</code>. </li>
 567
 568         <li> Check the available space of the data LV with <code>lvs
 569         -a</code>. Mount one snapshot (specifying <code>-o nouuid</code>)
 570         and run <code>lvs -a</code> again.  Why did the free space decrease
 571         although no new files were written? </li>
 572
 573         <li> Mount four different snapshots and check that they contain the
 574         expected files. </li>
 575
 576         <li> Remove all snapshots. Guess what <code>lvs -a</code> and <code>dh
 577         -h /mnt</code> report. Then run the commands to confirm. Guess
 578         what happens if you try to create another 3G file? Confirm
 579         your guess, then read the section on "Data space exhaustion" of
 580         <code>lvmthin(7)</code>. </li>
 581
 582 </ul>
 583
 584 HOMEWORK(«
 585
 586 When a thin pool provisions a new data block for a thin LV, the new
 587 block is first overwritten with zeros by default. Discuss why this
 588 is done, its impact on performance and security, and conclude whether
 589 or not it is a good idea to turn off the zeroing.
 590
 591 »)
 592
 593 SECTION(«Bcache, dm-cache and dm-writecache»)
 594
 595 <p> All three implementations named in the title of this chapter are <em>
 596 Linux block layer caches</em>. They combine two different block
 597 devices to form a hybrid block device which dynamically caches
 598 and migrates data between the two devices with the aim to improve
 599 performance. One device, the <em> backing device</em>, is expected
 600 to be large and slow while the other one, the <em>cache device</em>,
 601 is expected to be small and fast. </p>
 602
 603 <div>
 604 define(«bch_width», «300»)
 605 define(«bch_height», «130»)
 606 define(«bch_margin», «10»)
 607 define(«bch_rraid_width», «eval((bch_width() - 4 * bch_margin()) * 4 / 5)»)
 608 define(«bch_raidbox_height», «eval(bch_height() - 2 * bch_margin())»)
 609 define(«bch_nraid_width», «eval(bch_rraid_width() / 4)»)
 610 define(«bch_rdisk_width», «eval((bch_width() - 3 * bch_margin()) * 18 / 100)»)
 611 define(«bch_rdisk_height», «eval((bch_height() - 4 * bch_margin()) / 3)»)
 612 define(«bch_ndisk_width», «eval(bch_rdisk_width() / 2)»)
 613 define(«bch_ndisk_height», «eval(bch_raidbox_height() - 5 * bch_margin())»)
 614 define(«bch_rdisk», «svg_disk(«$1», «$2»,
 615         «bch_rdisk_width()», «bch_rdisk_height()», «#666»)»)
 616 define(«bch_ndisk», «svg_disk(«$1», «$2»,
 617         «bch_ndisk_width()», «bch_ndisk_height()», «#66f»)»)
 618 define(«bch_5rdisk», «
 619         bch_rdisk(«$1», «$2»)
 620         bch_rdisk(«eval($1 + bch_margin())»,
 621                 «eval($2 + bch_margin())»)
 622         bch_rdisk(«eval($1 + 2 * bch_margin())»,
 623                 «eval($2 + 2 * bch_margin())»)
 624         bch_rdisk(«eval($1 + 3 * bch_margin())»,
 625                 «eval($2 + 3 * bch_margin())»)
 626         bch_rdisk(«eval($1 + 4 * bch_margin())»,
 627                 «eval($2 + 4 * bch_margin())»)
 628
 629 »)
 630 define(«bch_rraid», «
 631         <rect
 632                 fill="#3b3"
 633                 stroke="black"
 634                 x="$1"
 635                 y="$2"
 636                 width="bch_rraid_width()"
 637                 height="bch_raidbox_height()"
 638                 rx=10
 639         />
 640         bch_5rdisk(«eval($1 + bch_margin())»,
 641                 «eval($2 + 2 * bch_margin())»)
 642         bch_5rdisk(«eval($1 + 2 * bch_rdisk_width() + bch_margin())»,
 643                 «eval($2 + 2 * bch_margin())»)
 644 »)
 645 define(«bch_nraid», «
 646         <rect
 647                 fill="orange"
 648                 stroke="black"
 649                 x="$1"
 650                 y="$2"
 651                 width="bch_nraid_width()"
 652                 height="bch_raidbox_height()"
 653                 rx=10
 654         />
 655         bch_ndisk(eval($1 + bch_margin()),
 656                 eval($2 + 2 * bch_margin()))
 657         bch_ndisk(eval($1 + 2 * bch_margin()),
 658                 eval($2 + 3 * bch_margin()))
 659 »)
 660
 661 <svg
 662         width="bch_width()" height="bch_height()"
 663         xmlns="http://www.w3.org/2000/svg"
 664         xmlns:xlink="http://www.w3.org/1999/xlink"
 665 >
 666         <rect
 667                 fill="#cc2"
 668                 stroke="black"
 669                 stroke-width="1"
 670                 x="1"
 671                 y="1"
 672                 width="eval(bch_rraid_width() + bch_nraid_width()
 673                         + 3 * bch_margin() - 2)"
 674                 height="eval(bch_raidbox_height() + 2 * bch_margin() - 2)"
 675                 rx="10"
 676         />
 677         bch_nraid(«bch_margin()», «bch_margin()»)
 678         bch_rraid(«eval(2 * bch_margin() + bch_nraid_width())», «bch_margin()»)
 679 </svg>
 680 </div>
 681
 682 <p> The most simple setup consists of a single rotating disk and one SSD.
 683 The setup shown in the diagram at the left is realistic for a large
 684 server with redundant storage.  In this setup the hybrid device
 685 (yellow) combines a raid6 array (green) consisting of many rotating
 686 disks (grey) with a two-disk raid1 array (orange) stored on fast
 687 NVMe devices (blue). In the simple setup it is always a win when
 688 I/O is performed from/to the SSD instead of the rotating disk. In
 689 the server setup, however, it depends on the workload which device
 690 is faster. Given enough rotating disks and a streaming I/O workload,
 691 the raid6 outperforms the raid1 because all disks can read or write
 692 at full speed. </p>
 693
 694 <p> Since block layer caches hook into the Linux block API described <a
 695 href="«#»the_linux_block_layer">earlier</a>, the hybrid block devices
 696 they provide can be used like any other block device. In particular,
 697 the hybrid devices are <em> file system agnostic</em>, meaning that
 698 any file system can be created on them. In what follows we briefly
 699 describe the differences between the three block layer caches and
 700 conclude with the pros and cons of each. </p>
 701
 702 <p> Bcache is a stand-alone stacking device driver which was
 703 included in the Linux kernel in 2013. According to the <a
 704 href="https://bcache.evilpiepirate.org/">bcache home page</a>, it
 705 is "done and stable". dm-cache and dm-writecache are device mapper
 706 targets included in 2013 and 2018, respectively, which are both marked
 707 as experimental. In contrast to dm-cache, dm-writecache only caches
 708 writes while reads are supposed to be cached in RAM. It has been
 709 designed for programs like databases which need low commit latency.
 710 Both bcache and dm-cache can operate in writeback or writethrough
 711 mode while dm-writecache always operates in writeback mode. </p>
 712
 713 <p> The DM-based caches are designed to leave the decision as to what
 714 data to migrate (and when) to user space while bcache has this policy
 715 built-in. However, at this point only the <em> Stochastic Multiqueue
 716 </em> (smq) policy for dm-cache exists, plus a second policy which
 717 is only useful for decommissioning the cache device. There are no
 718 tunables for dm-cache while all the bells and whistles of bcache can
 719 be configured through sysfs files.  Another difference is that bcache
 720 detects sequential I/O and separates it from random I/O so that large
 721 streaming reads and writes bypass the cache and don't push cached
 722 randomly accessed data out of the cache. </p>
 723
 724 <p> bcache is the clear  winner of this comparison because it is stable,
 725 configurable and performs better at least on the server setup
 726 described above because it separate random and sequential I/O. The
 727 only advantage of dm-cache is its flexibility because cache policies
 728 can be switched. But even this remains a theoretical advantage as
 729 long as only a single policy for dm-cache exists. </p>
 730
 731 EXERCISES()
 732
 733 <ul>
 734
 735         <li> Recall the concepts of writeback and writethrough and explain
 736         why writeback is faster and writethrough is safer. </li>
 737
 738         <li> Explain how the <em>writearound</em> mode of bcache works and
 739         when it should be used. </li>
 740
 741         <li> Setup a bcache device from two loop devices. </li>
 742
 743         <li> Create a file system of a bcache device and mount it. Detach
 744         the cache device while the file system is mounted. </li>
 745
 746         <li> Setup a dm-cache device from two loop devices. </li>
 747
 748         <li> Setup a thin pool where the data LV is a dm-cache device.</li>
 749
 750         <li> Explain the point of dm-cache's <em>passthrough</em> mode.</li>
 751
 752 </ul>
 753
 754 HOMEWORK(«
 755
 756 Explain why small writes to a file system which is stored on a
 757 parity raid result in read-modify-write (RMW) updates. Explain why
 758 RMW updates are particularly expensive and how raid implementations
 759 and block layer caches try to avoid them.
 760
 761 »)
 762
 763 HOMEWORK(«
 764
 765 Recall the concepts of writeback and writethrough. Describe what
 766 each mode means for a hardware device and for a bcache/dm-cache
 767 device. Explain why writeback is faster and writethrough is safer.
 768
 769 »)
 770
 771 HOMEWORK(«
 772
 773 TRIM and UNMAP are special commands in the ATA/SCSI command sets
 774 which inform an SSD that certain data blocks are no longer in use,
 775 allowing the SSD to re-use these blocks to increase performance and
 776 to reduce wear. Subsequent reads from the trimmed data blocks will
 777 not return any meaningful data. For example, the <code> mkfs </code>
 778 commands sends this command to discard all blocks of the device.
 779 Discuss the implications when <code> mkfs. </code> is run on a device
 780 provided by bcache or dm-cache.
 781
 782 »)
 783
 784 SECTION(«The dm-crypt Target»)
 785
 786 <p> This device mapper target provides encryption of arbitrary block
 787 devices by employing the primitives of the crypto API of the Linux
 788 kernel. This API provides a uniform interface to a large number of
 789 cipher algorithms which have been implemented with performance and
 790 security in mind. </p>
 791
 792 <p> The cipher algorithm of choice for the encryption of block devices
 793 is the <em> Advanced Encryption Standard </em> (AES), also known
 794 as <em> Rijndael</em>, named after the two Belgian cryptographers
 795 Rijmen and Daemen who proposed the algorithm in 1999. AES is a <em>
 796 symmetric block cipher</em>. That is, a transformation which operates
 797 on fixed-length blocks and which is determined by a single key for both
 798 encryption and decryption. The underlying algorithm is fairly simple,
 799 which makes AES perform well in both hardware and software. Also
 800 the key setup time and the memory requirements are excellent. Modern
 801 processors of all manufacturers include instructions to perform AES
 802 operations in hardware, improving speed and security. </p>
 803
 804 <p> According to the Snowden documents, the NSA has been doing research
 805 on breaking AES for a long time without being able to come up with
 806 a practical attack for 256 bit keys. Successful attacks invariably
 807 target the key management software instead, which is often implemented
 808 poorly, trading security for user-friendliness, for example by
 809 storing passwords weakly encrypted, or by providing a "feature"
 810 which can decrypt the device without knowing the password. </p>
 811
 812 <p> The exercises of this section ask the reader to encrypt a loop device
 813 with AES without relying on any third party key management software </p>.
 814
 815 EXERCISES()
 816 <ul>
 817         <li> Discuss the message of this <a
 818         href="https://xkcd.com/538/">xkcd</a> comic. </li>
 819
 820         <li> How can a hardware implementation of an algorithm like AES
 821         improve security? After all, it is the same algorithm that is
 822         implemented. </li>
 823
 824         <li> What's the point of the <a href="#random_stream">rstream.c</a>
 825         program below which writes random data to stdout? Doesn't <code>
 826         cat /dev/urandom </code> do the same? </li>
 827
 828         <li> Compile and run <a href="#random_stream">rstream.c</a> to create
 829         a 10G local file and create the loop device <code> /dev/loop0 </code>
 830         from the file. </li>
 831
 832         <li> A <em> table </em> for the <code> dmsetup(8) </code> command is
 833         a single line of the form <code> start_sector num_sectors target_type
 834         target_args</code>. Determine the correct values for the first three
 835         arguments to encrypt <code> /dev/loop0</code>. </li>
 836
 837         <li> The <code>target_args</code> for the dm-crypt target are
 838         of the form <code>cipher key iv_offset device offset</code>. To
 839         encrypt <code>/dev/loop0</code> with AES-256, <code>cipher</code>
 840         is <code>aes</code>, <code>device</code> is <code>/dev/loop0</code>
 841         and both offsets are zero. Come up with an idea to create a 256 bit
 842         key from a passphrase. </li>
 843
 844         <li> The <code> create </code> subcommand of <code> dmsetup(8)
 845         </code> creates a device from the given table. Run a command of
 846         the form <code> echo "$table" | dmsetup create cryptdev </code>
 847         to create the encrypted device <code> /dev/mapper/cryptdev </code>
 848         from the loop device. </li>
 849
 850         <li> Create a file system on <code> /dev/mapper/cryptdev</code>,
 851         mount it and create the file <code> passphrase </code> containing
 852         the string "super-secret" on this file system. </li>
 853
 854         <li> Unmount the <code> cryptdev </code> device and run <code> dmsetup
 855         remove cryptdev</code>. Run <code> strings </code> on the loop device
 856         and on the underlying file to see if it contains the string <code>
 857         super-secret" </code> or <code> passphrase</code>. </li>
 858
 859         <li> Re-create the <code> cryptdev </code> device, but this time use
 860         a different (hence invalid) key. Guess what happens and confirm. </li>
 861
 862         <li> Write a script which disables echoing (<code>stty -echo</code>),
 863         reads a passphrase from stdin and combines the above steps to create
 864         and mount an encrypted device. </li>
 865
 866 </ul>
 867
 868 HOMEWORK(«
 869
 870 Why is it a good idea to overwrite a block device with random data
 871 before it is encrypted?
 872
 873 »)
 874
 875 HOMEWORK(«
 876
 877 The dm-crypt target encrypts whole block devices. An alternative is
 878 to encrypt on the file system level. That is, each file is encrypted
 879 separately. Discuss the pros and cons of both approaches.
 880
 881 »)
 882
 883 SUPPLEMENTS()
 884
 885 SUBSECTION(«Random stream»)
 886
 887 <pre>
 888         <code>
 889                 /* Link with -lcrypto */
 890                 #include &lt;openssl/rand.h&gt;
 891                 #include &lt;stdio.h&gt;
 892                 #include &lt;unistd.h&gt;
 893                 #include &lt;stdio.h&gt;
 894
 895                 int main(int argc, char **argv)
 896                 {
 897                         unsigned char buf[1024 * 1024];
 898
 899                         for (;;) {
 900                                 int ret = RAND_bytes(buf, sizeof(buf));
 901
 902                                 if (ret &lt;= 0) {
 903                                         fprintf(stderr, "RAND_bytes() error\n");
 904                                         exit(EXIT_FAILURE);
 905                                 }
 906                                 ret = write(STDOUT_FILENO, buf, sizeof(buf));
 907                                 if (ret &lt; 0) {
 908                                         perror("write");
 909                                         exit(EXIT_FAILURE);
 910                                 }
 911                         }
 912                         return 0;
 913                 }
 914         </code>
 915 </pre>