LVM.m4

   1 TITLE(«
   2
   3         Who the heck is General Failure, and why is he reading my disk? -- Unknown
   4
   5 », __file__)
   6
   7 OVERVIEW(«
   8
   9 The idea of Logical Volume Management is to decouple data and
  10 storage. This offers great flexibility in managing storage and reduces
  11 server downtimes because the storage may be replaced while file
  12 systems are mounted read-write and applications are actively using
  13 them. This chapter provides an introduction to the Linux block layer
  14 and LVM. Subsequent sections cover selected device mapper targets.
  15
  16 »)
  17
  18 SECTION(«The Linux Block Layer»)
  19
  20 <p> The main task of LVM is the management of block devices, so it is
  21 natural to start an introduction to LVM with a section on the Linux
  22 block layer, which is the central component in the Linux kernel
  23 for the handling of persistent storage devices. The mission of the
  24 block layer is to provide a uniform interface to different types
  25 of storage devices. The obvious in-kernel users of this interface
  26 are the file systems and the swap subsystem. But also <em> stacking
  27 device drivers </em> like LVM, Bcache and MD access block devices
  28 through this interface to create virtual block devices from other block
  29 devices. Some user space programs (<code>fdisk, dd, mkfs, ...</code>)
  30 also need to access block devices. The block layer allows them to
  31 perform their task in a well-defined and uniform manner through
  32 block-special device files. </p>
  33
  34 <p> The userspace programs and the in-kernel users interact with the block
  35 layer by sending read or write requests. A <em>bio</em> is the central
  36 data structure that carries such requests within the kernel. Bios
  37 may contain an arbitrary amount of data. They are given to the block
  38 layer to be queued for subsequent handling. Often a bio has to travel
  39 through a stack of block device drivers where each driver modifies
  40 the bio and sends it on to the next driver. Typically, only the last
  41 driver in the stack corresponds to a hardware device. </p>
  42
  43 <p> Besides requests to read or write data blocks, there are various other
  44 bio requests that carry SCSI commands like FLUSH, FUA (Force Unit
  45 Access), TRIM and UNMAP. FLUSH and FUA ensure that certain data hits
  46 stable storage. FLUSH asks the the device to write out the contents of
  47 its volatile write cache while a FUA request carries data that should
  48 be written directly to the device, bypassing all caches. UNMAP/TRIM is
  49 a SCSI/ATA command which is only relevant to SSDs. It is a promise of
  50 the OS to not read the given range of blocks any more, so the device
  51 is free to discard the contents and return arbitrary data on the
  52 next read. This helps the device to level out the number of times
  53 the flash storage cells are overwritten (<em>wear-leveling</em>),
  54 which improves the durability of the device. </p>
  55
  56 <p> The first task of the block layer is to split incoming bios if
  57 necessary to make them conform to the size limit or the alignment
  58 requirements of the target device, and to batch and merge bios so that
  59 they can be submitted as a unit for performance reasons. The thusly
  60 processed bios then form an I/O request which is handed to an <em>
  61 I/O scheduler </em> (also known as <em> elevator</em>). </p>
  62
  63 <p> At this time of writing (2018-11) there exist two different sets
  64 of schedulers: the traditional single-queue schedulers and the
  65 modern multi-queue schedulers, which are expected to replace the
  66 single-queue schedulers soon. The three single-queue schedulers,
  67 noop, deadline and cfq (complete fair queueing), were designed for
  68 rotating disks. They reorder requests with the aim to minimize seek
  69 time. The newer multi-queue schedulers, mq-deadline, kyber, and bfq
  70 (budget fair queueing), aim to max out even the fastest devices. As
  71 implied by the name "multi-queue", they implement several request
  72 queues, the number of which depends on the hardware in use. This
  73 has become necessary because modern storage hardware allows multiple
  74 requests to be submitted in parallel from different CPUs. Moreover,
  75 with many CPUs the locking overhead required to put a request into
  76 a queue increases. Per-CPU queues allow for per-CPU locks, which
  77 decreases queue lock contention. </p>
  78
  79 <p> We will take a look at some aspects of the Linux block layer and on
  80 the various I/O schedulers. An exercise on loop devices enables the
  81 reader to create block devices for testing. This will be handy in
  82 the subsequent sections on LVM specific topics. </p>
  83
  84 EXERCISES()
  85
  86 <ul>
  87
  88         <li> Run <code>find /dev -type b</code> to get the list of all block
  89         devices on your system. Explain which is which. </li>
  90
  91         <li> Examine the files in <code>/sys/block/sda</code>, in
  92         particular <code>/sys/block/sda/stat</code>. Search the web for
  93         <code>Documentation/block/stat.txt</code> for the meaning of the
  94         numbers shown. Then run <code>iostat -xdh sda 1</code>. </li>
  95
  96         <li> Examine the files in <code>/sys/block/sda/queue</code>. </li>
  97
  98         <li> Find out how to determine the size of a block device. </li>
  99
 100         <li> Figure out a way to identify the name of all block devices which
 101         correspond to SSDs (i.e., excluding any rotating disks). </li>
 102
 103         <li> Run <code>lsblk</code> and discuss
 104         the output. Too easy? Run <code>lsblk -o
 105         KNAME,PHY-SEC,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,RQ-SIZE,ROTA,SCHED</code>
 106         </li>
 107
 108         <li> What's the difference between a task scheduler and an I/O
 109         scheduler? </li>
 110
 111         <li> Why are I/O schedulers also called elevators? </li>
 112
 113         <li> How can one find out which I/O schedulers are supported on a
 114         system and which scheduler is active for a given block device? </li>
 115
 116         <li> Is it possible (and safe) to change the I/O scheduler for a
 117         block device while it is in use? If so, how can this be done? </li>
 118
 119         <li> The loop device driver of the Linux kernel allows privileged
 120         users to create a block device from a regular file stored on a file
 121         system.  The resulting block device is called a <em>loop</em> device.
 122         Create a 1G large temporary file containing only zeroes. Run a suitable
 123         <code>losetup(8)</code> command to create a loop device from the
 124         file. Create an XFS file system on the loop device and mount it. </li>
 125
 126 </ul>
 127
 128 HOMEWORK(«
 129
 130 <ul>
 131         <li> Come up with three different use cases for loop devices. </li>
 132
 133         <li> Given a block device node in <code> /dev</code>, how can one
 134         tell that it is a loop device? </li>
 135
 136         <li> Describe the connection between loop devices created by
 137         <code>losetup(8)</code> and the loopback device used for network
 138         connections from the machine to itself. </li>
 139
 140 </ul>
 141 »)
 142
 143 define(«svg_disk», «
 144         <g
 145                 fill="$5"
 146                 stroke="black"
 147                 stroke-width="1"
 148         >
 149         <ellipse
 150                 cx="eval($1 + $3 / 2)"
 151                 cy="eval($2 + $4)"
 152                 rx="eval($3 / 2)"
 153                 ry="eval($3 / 4)"
 154         />
 155         <rect
 156                 x="$1"
 157                 y="$2"
 158                 width="$3"
 159                 height="$4"
 160         />
 161         <rect
 162                 x="eval($1 + 1)"
 163                 y="eval($2 + $4 - 1)"
 164                 width="eval($3 - 2)"
 165                 height="2"
 166                 stroke="$5"
 167         />
 168         <ellipse
 169                 cx="eval($1 + $3 / 2)"
 170                 cy="$2"
 171                 rx="eval($3 / 2)"
 172                 ry="eval($3 / 4)"
 173         />
 174         </g>
 175 »)
 176
 177 SECTION(«Physical and Logical Volumes, Volume Groups»)
 178
 179 <p> Getting started with the Logical Volume Manager (LVM) requires to
 180 get used to a minimal set of vocabulary. This section introduces
 181 the words named in the title of the section, and a couple more.
 182 The basic concepts of LVM are then described in terms of these words. </p>
 183
 184 <div>
 185 define(lvm_width», «300»)
 186 define(«lvm_height», «183»)
 187 define(«lvm_margin», «10»)
 188 define(«lvm_extent_size», «10»)
 189 define(«lvm_extent», «
 190         <rect
 191                 fill="$1"
 192                 x="$2"
 193                 y="$3"
 194                 width="lvm_extent_size()"
 195                 height="lvm_extent_size()"
 196                 stroke="black"
 197                 stroke-width="1"
 198         />
 199 »)
 200 dnl $1: color, $2: x, $3: y, $4: number of extents
 201 define(«lvm_extents», «
 202         ifelse(«$4», «0», «», «
 203                 lvm_extent(«$1», «$2», «$3»)
 204                 lvm_extents(«$1», eval($2 + lvm_extent_size() + lvm_margin()),
 205                         «$3», eval($4 - 1))
 206         »)
 207 »)
 208 dnl $1: x, $2: y, $3: number of extents, $4: disk color, $5: extent color
 209 define(«lvm_disk», «
 210         ifelse(eval(«$3» > 3), «1», «
 211                 pushdef(«h», «eval(7 * lvm_extent_size())»)
 212                 pushdef(«w», «eval(($3 + 1) * lvm_extent_size())»)
 213         », «
 214                 pushdef(«h», «eval(3 * lvm_extent_size() + lvm_margin())»)
 215                 pushdef(«w», «eval($3 * lvm_extent_size() * 2)»)
 216         »)
 217         svg_disk(«$1», «$2», «w()», «h()», «$4»)
 218         ifelse(eval(«$3» > 3), «1», «
 219                 pushdef(«n1», eval(«$3» / 2))
 220                 pushdef(«n2», eval(«$3» - n1()))
 221                 lvm_extents(«$5»,
 222                         eval(«$1» + (w() - (2 * n1() - 1) * lvm_extent_size()) / 2),
 223                         eval(«$2» + h() / 2 - lvm_extent_size()), «n1()»)
 224                 lvm_extents(«$5»,
 225                         eval(«$1» + (w() - (2 * n2() - 1) * lvm_extent_size()) / 2),
 226                         eval(«$2» + h() / 2 + 2 * lvm_extent_size()), «n2()»)
 227                 popdef(«n1»)
 228                 popdef(«n2»)
 229         », «
 230                 lvm_extents(«$5»,
 231                         eval(«$1» + (w() - (2 * «$3» - 1) * lvm_extent_size()) / 2),
 232                         eval(«$2» + h() / 2), «$3»)
 233         »)
 234         popdef(«w»)
 235         popdef(«h»)
 236 »)
 237 <svg
 238         width="lvm_width()" height="lvm_height()"
 239         xmlns="http://www.w3.org/2000/svg"
 240         xmlns:xlink="http://www.w3.org/1999/xlink"
 241 >
 242         <rect
 243                 x=1
 244                 y=1
 245                 width="140"
 246                 height="180"
 247                 fill="green"
 248                 rx="10"
 249                 stroke-width="1"
 250                 stroke="black"
 251         />
 252         lvm_disk(«20», «20», «2», «#666», «yellow»)
 253         lvm_disk(«10», «90», «4», «#666», «yellow»)
 254         lvm_disk(«70», «55», «5», «#666», «yellow»)
 255         <path
 256                 d="
 257                         M 155 91
 258                         l 20 0
 259                         m 0 0
 260                         l -4 -3
 261                         l 0 6
 262                         l 4 -3
 263                         z
 264                 "
 265                 stroke-width="4"
 266                 stroke="black"
 267                 fill="black"
 268         />
 269         lvm_disk(«190», «22», «7», «#66f», «orange»)
 270         lvm_disk(«220», «130», «1», «#66f», «orange»)
 271 </svg>
 272 </div>
 273
 274 <p> A <em> Physical Volume</em> (PV, grey) is an arbitrary block device which
 275 contains a certain metadata header (also known as <em>superblock</em>)
 276 at the start. PVs can be partitions on a local hard disk or a SSD,
 277 a soft- or hardware raid, or a loop device. LVM does not care.
 278 The storage space on a physical volume is managed in units called <em>
 279 Physical Extents </em> (PEs, yellow). The default PE size is 4M. </p>
 280
 281 <p> A <em>Volume Group</em> (VG, green) is a non-empty set of PVs with
 282 a name and a unique ID assigned to it. A PV can but doesn't need to
 283 be assigned to a VG. If it is, the ID of the associated VG is stored
 284 in the metadata header of the PV. </p>
 285
 286 <p> A <em> Logical Volume</em> (LV, blue) is a named block device which is
 287 provided by LVM. LVs are always associated with a VG and are stored
 288 on that VG's PVs. Since LVs are normal block devices, file systems
 289 of any type can be created on them, they can be used as swap storage,
 290 etc. The chunks of a LV are managed as <em>Logical Extents</em> (LEs,
 291 orange). Often the LE size equals the PE size.  For each LV there is
 292 a mapping between the LEs of the LV and the PEs of the underlying
 293 PVs. The PEs can spread multiple PVs. </p>
 294
 295 <p> VGs can be extended by adding additional PVs to it, or reduced by
 296 removing unused devices, i.e., those with no PEs allocated on them. PEs
 297 may be moved from one PV to another while the LVs are active. LVs
 298 may be grown or shrunk. To grow a LV, there must be enough space
 299 left in the VG. Growing a LV does not magically grow the file system
 300 stored on it, however. To make use of the additional space, a second,
 301 file system specific step is needed to tell the file system that it's
 302 underlying block device (the LV) has grown. </p>
 303
 304 <p> The exercises of this section illustrate the basic LVM concepts
 305 and the essential LVM commands. They ask the reader to create a VG
 306 whose PVs are loop devices. This VG is used as a starting point in
 307 subsequent chapters. </p>
 308
 309 EXERCISES()
 310
 311 <ul>
 312
 313         <li> Create two 5G large loop devices <code>/dev/loop1</code>
 314         and <code>/dev/loop2</code>. Make them PVs by running
 315         <code>pvcreate</code>. Create a VG <code>tvg</code> (test volume group)
 316         from the two loop devices and two 3G large LVs named <code>tlv1</code>
 317         and <code>tlv2</code> on it. Run the <code>pvcreate, vgcreate</code>,
 318         and <code>lvcreate</code> commands with <code>-v</code> to activate
 319         verbose output and try to understand each output line. </li>
 320
 321         <li> Run <code>pvs, vgs, lvs, lvdisplay, pvdisplay</code> and examine
 322         the output. </li>
 323
 324         <li> Run <code>lvdisplay -m</code> to examine the mapping of logical
 325         extents to PVs and physical extents. </li>
 326
 327         <li> Run <code>pvs --segments -o+lv_name,seg_start_pe,segtype</code>
 328         to see the map between physical extents and logical extents. </li>
 329
 330 </ul>
 331
 332 HOMEWORK(«
 333
 334 In the above scenario (two LVs in a VG consisting of two PVs), how
 335 can you tell whether both PVs are actually used? Remove the LVs
 336 with <code>lvremove</code>. Recreate them, but this time use the
 337 <code>--stripes 2</code> option to <code>lvcreate</code>. Explain
 338 what this option does and confirm with a suitable command.
 339
 340 »)
 341
 342 SECTION(«Device Mapper and Device Mapper Targets»)
 343
 344 <p> The kernel part of the Logical Volume Manager (LVM) is called
 345 <em>device mapper</em> (DM), which is a generic framework to map
 346 one block device to another. Applications talk to the Device Mapper
 347 via the <em>libdevmapper</em> library, which issues requests
 348 to the <code>/dev/mapper/control</code> character device using the
 349 <code>ioctl(2)</code> system call. The device mapper is also accessible
 350 from scripts via the <code>dmsetup(8)</code> tool. </p>
 351
 352 <p> A DM target represents one particular mapping type for ranges
 353 of LEs. Several DM targets exist, each of which which creates and
 354 maintains block devices with certain characteristics. In this section
 355 we take a look at the <code>dmsetup</code> tool and the relatively
 356 simple <em>mirror</em> target. Subsequent sections cover other targets
 357 in more detail. </p>
 358
 359 EXERCISES()
 360
 361 <ul>
 362
 363         <li> Run <code>dmsetup targets</code> to list all targets supported
 364         by the currently running kernel. Explain their purpose and typical
 365         use cases. </li>
 366
 367         <li> Starting with the <code>tvg</code> VG, remove <code>tlv2</code>.
 368         Convince yourself by running <code>vgs</code> that <code>tvg</code>
 369         is 10G large, with 3G being in use. Run <code>pvmove
 370         /dev/loop1</code> to move the used PEs of <code>/dev/loop1</code>
 371         to <code>/dev/loop2</code>. After the command completes, run
 372         <code>pvs</code> again to see that <code>/dev/loop1</code> has no
 373         more PEs in use. </li>
 374
 375         <li> Create a third 5G loop device <code>/dev/loop3</code>, make it a
 376         PV and extend the VG with <code>vgextend tvg /dev/loop3</code>. Remove
 377         <code>tlv1</code>. Now the LEs of <code>tlv2</code> fit on any
 378         of the three PVs.  Come up with a command which moves them to
 379         <code>/dev/loop3</code>. </li>
 380
 381         <li> The first two loop devices are both unused. Remove them from
 382         the VG with <code>vgreduce -a</code>. Why are they still listed in
 383         the <code>pvs</code> output? What can be done about that? </li>
 384
 385 </ul>
 386
 387 HOMEWORK(«
 388
 389 As advertised in the introduction, LVM allows the administrator to
 390 replace the underlying storage of a file system online. This is done
 391 by running a suitable <code>pvmove(8)</code> command to move all PEs of
 392 one PV to different PVs in the same VG.
 393
 394 <ul>
 395
 396         <li> Explain the mapping type of dm-mirror. </li>
 397
 398         <li> The traditional way to mirror the contents of two or more block
 399         devices is software raid 1, also known as <em>md raid1</em> ("md"
 400         is short for multi-disk). Explain the difference between md raid1,
 401         the dm-raid target which supports raid1 and other raid levels, and
 402         the dm-mirror target. </li>
 403
 404         <li> Guess how <code>pvmove</code> is implemented on top of
 405         dm-mirror. Verify your guess by reading the "NOTES" section of the
 406         <code>pvmove(8)</code> man page. </li>
 407
 408 </ul>
 409 »)
 410
 411 SECTION(«LVM Snapshots»)
 412
 413 <p> LVM snapshots are based on the CoW optimization
 414 strategy described earlier in the chapter on <a
 415 href="./Unix_Concepts.html#the_virtual_address_space_of_a_unix_process">Unix
 416 Concepts</a>. Creating a snapshot means to create a CoW table of
 417 the given size. Just before a LE of a snapshotted LV is about to be
 418 written to, its contents are copied to a free slot in the CoW
 419 table. This preserves an old version of the LV, the snapshot, which
 420 can later be reconstructed by overlaying the CoW table atop the LV.
 421
 422 <p> Snapshots can be taken from a LV which contains a mounted file system,
 423 while applications are actively modifying files. Without coordination
 424 between the file system and LVM, the file system most likely has memory
 425 buffers scheduled for writeback. These outstanding writes did not make
 426 it to the snapshot, so one can not expect the snapshot to contain a
 427 consistent file system image. Instead, it is in a similar state as a
 428 regular device after an unclean shutdown. This is not a problem for
 429 XFS and EXT4, as both are <em>journalling</em> file systems, which
 430 were designed with crash recovery in mind. At the next mount after a
 431 crash, journalling file systems replay their journal, which results
 432 in a consistent state. Note that this implies that even a read-only
 433 mount of the snapshot device has to write to the device. </p>
 434
 435 EXERCISES()
 436
 437 <ul>
 438
 439         <li> In the test VG, create a 1G large snapshot named
 440         <code>snap_tlv1</code> of the <code>tlv1</code> VG by using the
 441         <code>-s</code> option to <code>lvcreate(8)</code>. Predict how much
 442         free space is left in the VG. Confirm with <code>vgs tvg</code>. </li>
 443
 444         <li> Create an EXT4 file system on <code>tlv1</code> by running
 445         <code>mkfs.ext4 /dev/tvg/lv1</code>. Guess how much of the snapshot
 446         space has been allocated by this operation. Check with <code>lvs
 447         tvg1/snap_lv1</code>. </li>
 448
 449         <li> Remove the snapshot with <code>lvremove</code> and recreate
 450         it. Repeat the previous step, but this time run <code>mkfs.xfs</code>
 451         to create an XFS file system. Run <code>lvs tvg/snap_lv1</code>
 452         again and compare the used snapshot space to the EXT4 case. Explain
 453         the difference. </li>
 454
 455         <li> Remove the snapshot and recreate it so that both <code>tlv1</code>
 456         and <code>snap_tlv1</code> contain a valid XFS file system. Mount
 457         the file systems on <code>/mnt/1</code> and <code>/mnt/2</code>. </li>
 458
 459         <li> Run <code>dd if=/dev/zero of=/mnt/1/zero count=$((2 * 100 *
 460         1024))</code> to create a 100M large file on <code>tlv1</code>. Check
 461         that <code>/mnt/2</code> is still empty. Estimate how much of the
 462         snapshot space is used and check again. </li>
 463
 464         <li> Repeat the above <code>dd</code> command 5 times and run
 465         <code>lvs</code> again. Explain why the used snapshot space did not
 466         increase. </li>
 467
 468         <li> It is possible to create snapshots of snapshots. This is
 469         implemented by chaining together CoW tables. Describe the impact on
 470         performance. </li>
 471
 472         <li> Suppose a snapshot was created before significant modifications
 473         were made to the contents of the LV, for example an upgrade of a large
 474         software package. Assume that the user wishes to permanently return to
 475         the old version because the upgrade did not work out. In this scenario
 476         it is the snapshot which needs to be retained, rather than the original
 477         LV.  In view of this scenario, guess what happens on the attempt to
 478         remove a LV which is being snapshotted. Unmount <code>/mnt/1</code>
 479         and confirm by running <code>lvremove tvg/lv1</code>. </li>
 480
 481         <li> Come up with a suitable <code>lvconvert</code> command which
 482         replaces the role of the LV and its snapshot. Explain why this solves
 483         the "bad upgrade" problem outlined above. </li>
 484
 485         <li> Explain what happens if the CoW table fills up. Confirm by
 486         writing a file larger than the snapshot size. </li>
 487
 488 </ul>
 489
 490 SECTION(«Thin Provisioning»)
 491
 492 <p> The term "thin provisioning" is just a modern buzzword for
 493 over-subscription. Both terms mean to give the appearance of having
 494 more resources than are actually available. This is achieved by
 495 on-demand allocation. The thin provisioning implementation of Linux
 496 is implemented as a DM target called <em>dm-thin</em>. This code
 497 first made its appearance in 2011 and was declared as stable two
 498 years later. These days it should be safe for production use. </p>
 499
 500 <p> The general problem with thin provisioning is of course that bad
 501 things happen when the resources are exhausted because the demand has
 502 increased before new resources were added. For dm-thin this can happen
 503 when users write to their allotted space, causing dm-thin to attempt
 504 allocating a data block from a volume which is already full. This
 505 usually leads to severe data corruption because file systems are
 506 not really prepared to handle this error case and treat it as if the
 507 underlying block device had failed. dm-thin does nothing to prevent
 508 this, but one can configure a <em>low watermark</em>.  When the
 509 number of free data blocks drops below the watermark, a so-called
 510 <em>dm-event</em> will be generated to notice the administrator. </p>
 511
 512 <p> One highlight of dm-thin is its efficient support for an arbitrary
 513 depth of recursive snapshots, called <em>dm-thin snapshots</em>
 514 in this document. With the traditional snapshot implementation,
 515 recursive snapshots quickly become a performance issue as the depth
 516 increases. With dm-thin one can have an arbitrary subset of all
 517 snapshots active at any point in time, and there is no ordering
 518 requirement on activating or removing them. </p>
 519
 520 <p> The block devices created by dm-thin always belong to a <em>thin
 521 pool</em> which ties together two LVs called the <em>metadata LV</em>
 522 and the <em>data LV</em>. The combined LV is called the <em>thin pool
 523 LV</em>. Setting up a VG for thin provisioning is done in two steps:
 524 First the standard LVs for data and the metatdata are created. Second,
 525 the two LVs are combined into a thin pool LV. The second step hides
 526 the two underlying LVs so that only the combined thin pool LV is
 527 visible afterwards. Thin provisioned LVs and dm-thin snapshots can
 528 then be created from the thin pool LV with a single command. </p>
 529
 530 <p> Another nice feature of dm-thin are <em>external snapshots</em>.
 531 An external snapshot is one where the origin for a thinly provisioned
 532 device is not a device of the pool. Arbitrary read-only block
 533 devices can be turned into writable devices by creating an external
 534 snapshot. Reads to an unprovisioned area of the snapshot will be passed
 535 through to the origin. Writes trigger the allocation of new blocks as
 536 usual with CoW. One use case for this is VM hosts which run their VMs
 537 on thinly-provisioned volumes but have the base image on some "master"
 538 device which is read-only and can hence be shared between all VMs. </p>
 539
 540 EXERCISES()
 541
 542 <p> Starting with the <code>tvg</code> VG, create and test a thin pool LV
 543 by performing the following steps.  The "Thin Usage" section of
 544 <code>lvmthin(7)</code> will be helpful.
 545
 546 <ul>
 547
 548         <li> Remove the <code>tlv1</code> and <code>tlv2</code> LVs. </li>
 549
 550         <li> Create a 5G data LV named <code>tdlv</code> (thin data LV)
 551         and a 500M LV named <code>tmdlv</code> (thin metada LV). </li>
 552
 553         <li> Combine the two LVs into a thin pool with
 554         <code>lvconvert</code>. Run <code>lvs -a</code> and explain the flags
 555         listed below <code>Attr</code>. </li>
 556
 557         <li> Create a 10G thin LV named <code>oslv</code> (over-subscribed
 558         LV). </li>
 559
 560         <li> Create an XFS file system on <code>oslv</code> and mount it on
 561         <code>/mnt</code>. </li>
 562
 563         <li> Run a loop of the form <code>for ((i = 0; i &lt; 50; i++)): do
 564         ... ; done</code> so that each iteration creates a 50M file named
 565         <code>file-$i</code> and a snapshot named <code>snap_oslv-$i</code>
 566         of <code>oslv</code>. </li>
 567
 568         <li> Activate an arbitrary snapshot with <code>lvchange -K</code> and
 569         try to mount it. Explain what the error message means. Then read the
 570         "XFS on snapshots" section of <code>lvmthin(7)</code>. </li>
 571
 572         <li> Check the available space of the data LV with <code>lvs
 573         -a</code>. Mount one snapshot (specifying <code>-o nouuid</code>)
 574         and run <code>lvs -a</code> again.  Why did the free space decrease
 575         although no new files were written? </li>
 576
 577         <li> Mount four different snapshots and check that they contain the
 578         expected files. </li>
 579
 580         <li> Remove all snapshots. Guess what <code>lvs -a</code> and <code>dh
 581         -h /mnt</code> report. Then run the commands to confirm. Guess
 582         what happens if you try to create another 3G file? Confirm
 583         your guess, then read the section on "Data space exhaustion" of
 584         <code>lvmthin(7)</code>. </li>
 585
 586 </ul>
 587
 588 HOMEWORK(«
 589
 590 When a thin pool provisions a new data block for a thin LV, the new
 591 block is first overwritten with zeros by default. Discuss why this
 592 is done, its impact on performance and security, and conclude whether
 593 or not it is a good idea to turn off the zeroing.
 594
 595 »)
 596
 597 SECTION(«Bcache, dm-cache and dm-writecache»)
 598
 599 <p> All three implementations named in the title of this chapter are <em>
 600 Linux block layer caches</em>. They combine two different block
 601 devices to form a hybrid block device which dynamically caches
 602 and migrates data between the two devices with the aim to improve
 603 performance. One device, the <em> backing device</em>, is expected
 604 to be large and slow while the other one, the <em>cache device</em>,
 605 is expected to be small and fast. </p>
 606
 607 <div>
 608 define(«bch_width», «300»)
 609 define(«bch_height», «130»)
 610 define(«bch_margin», «10»)
 611 define(«bch_rraid_width», «eval((bch_width() - 4 * bch_margin()) * 4 / 5)»)
 612 define(«bch_raidbox_height», «eval(bch_height() - 2 * bch_margin())»)
 613 define(«bch_nraid_width», «eval(bch_rraid_width() / 4)»)
 614 define(«bch_rdisk_width», «eval((bch_width() - 3 * bch_margin()) * 18 / 100)»)
 615 define(«bch_rdisk_height», «eval((bch_height() - 4 * bch_margin()) / 3)»)
 616 define(«bch_ndisk_width», «eval(bch_rdisk_width() / 2)»)
 617 define(«bch_ndisk_height», «eval(bch_raidbox_height() - 5 * bch_margin())»)
 618 define(«bch_rdisk», «svg_disk(«$1», «$2»,
 619         «bch_rdisk_width()», «bch_rdisk_height()», «#666»)»)
 620 define(«bch_ndisk», «svg_disk(«$1», «$2»,
 621         «bch_ndisk_width()», «bch_ndisk_height()», «#66f»)»)
 622 define(«bch_5rdisk», «
 623         bch_rdisk(«$1», «$2»)
 624         bch_rdisk(«eval($1 + bch_margin())»,
 625                 «eval($2 + bch_margin())»)
 626         bch_rdisk(«eval($1 + 2 * bch_margin())»,
 627                 «eval($2 + 2 * bch_margin())»)
 628         bch_rdisk(«eval($1 + 3 * bch_margin())»,
 629                 «eval($2 + 3 * bch_margin())»)
 630         bch_rdisk(«eval($1 + 4 * bch_margin())»,
 631                 «eval($2 + 4 * bch_margin())»)
 632
 633 »)
 634 define(«bch_rraid», «
 635         <rect
 636                 fill="#3b3"
 637                 stroke="black"
 638                 x="$1"
 639                 y="$2"
 640                 width="bch_rraid_width()"
 641                 height="bch_raidbox_height()"
 642                 rx=10
 643         />
 644         bch_5rdisk(«eval($1 + bch_margin())»,
 645                 «eval($2 + 2 * bch_margin())»)
 646         bch_5rdisk(«eval($1 + 2 * bch_rdisk_width() + bch_margin())»,
 647                 «eval($2 + 2 * bch_margin())»)
 648 »)
 649 define(«bch_nraid», «
 650         <rect
 651                 fill="orange"
 652                 stroke="black"
 653                 x="$1"
 654                 y="$2"
 655                 width="bch_nraid_width()"
 656                 height="bch_raidbox_height()"
 657                 rx=10
 658         />
 659         bch_ndisk(eval($1 + bch_margin()),
 660                 eval($2 + 2 * bch_margin()))
 661         bch_ndisk(eval($1 + 2 * bch_margin()),
 662                 eval($2 + 3 * bch_margin()))
 663 »)
 664
 665 <svg
 666         width="bch_width()" height="bch_height()"
 667         xmlns="http://www.w3.org/2000/svg"
 668         xmlns:xlink="http://www.w3.org/1999/xlink"
 669 >
 670         <rect
 671                 fill="#cc2"
 672                 stroke="black"
 673                 stroke-width="1"
 674                 x="1"
 675                 y="1"
 676                 width="eval(bch_rraid_width() + bch_nraid_width()
 677                         + 3 * bch_margin() - 2)"
 678                 height="eval(bch_raidbox_height() + 2 * bch_margin() - 2)"
 679                 rx="10"
 680         />
 681         bch_nraid(«bch_margin()», «bch_margin()»)
 682         bch_rraid(«eval(2 * bch_margin() + bch_nraid_width())», «bch_margin()»)
 683 </svg>
 684 </div>
 685
 686 <p> The most simple setup consists of a single rotating disk and one SSD.
 687 The setup shown in the diagram at the left is realistic for a large
 688 server with redundant storage.  In this setup the hybrid device
 689 (yellow) combines a raid6 array (green) consisting of many rotating
 690 disks (grey) with a two-disk raid1 array (orange) stored on fast
 691 NVMe devices (blue). In the simple setup it is always a win when
 692 I/O is performed from/to the SSD instead of the rotating disk. In
 693 the server setup, however, it depends on the workload which device
 694 is faster. Given enough rotating disks and a streaming I/O workload,
 695 the raid6 outperforms the raid1 because all disks can read or write
 696 at full speed. </p>
 697
 698 <p> Since block layer caches hook into the Linux block API described <a
 699 href="«#»the_linux_block_layer">earlier</a>, the hybrid block devices
 700 they provide can be used like any other block device. In particular,
 701 the hybrid devices are <em> file system agnostic</em>, meaning that
 702 any file system can be created on them. In what follows we briefly
 703 describe the differences between the three block layer caches and
 704 conclude with the pros and cons of each. </p>
 705
 706 <p> Bcache is a stand-alone stacking device driver which was
 707 included in the Linux kernel in 2013. According to the <a
 708 href="https://bcache.evilpiepirate.org/">bcache home page</a>, it
 709 is "done and stable". dm-cache and dm-writecache are device mapper
 710 targets included in 2013 and 2018, respectively, which are both marked
 711 as experimental. In contrast to dm-cache, dm-writecache only caches
 712 writes while reads are supposed to be cached in RAM. It has been
 713 designed for programs like databases which need low commit latency.
 714 Both bcache and dm-cache can operate in writeback or writethrough
 715 mode while dm-writecache always operates in writeback mode. </p>
 716
 717 <p> The DM-based caches are designed to leave the decision as to what
 718 data to migrate (and when) to user space while bcache has this policy
 719 built-in. However, at this point only the <em> Stochastic Multiqueue
 720 </em> (smq) policy for dm-cache exists, plus a second policy which
 721 is only useful for decommissioning the cache device. There are no
 722 tunables for dm-cache while all the bells and whistles of bcache can
 723 be configured through sysfs files.  Another difference is that bcache
 724 detects sequential I/O and separates it from random I/O so that large
 725 streaming reads and writes bypass the cache and don't push cached
 726 randomly accessed data out of the cache. </p>
 727
 728 <p> bcache is the clear  winner of this comparison because it is stable,
 729 configurable and performs better at least on the server setup
 730 described above because it separate random and sequential I/O. The
 731 only advantage of dm-cache is its flexibility because cache policies
 732 can be switched. But even this remains a theoretical advantage as
 733 long as only a single policy for dm-cache exists. </p>
 734
 735 EXERCISES()
 736
 737 <ul>
 738
 739         <li> Recall the concepts of writeback and writethrough and explain
 740         why writeback is faster and writethrough is safer. </li>
 741
 742         <li> Explain how the <em>writearound</em> mode of bcache works and
 743         when it should be used. </li>
 744
 745         <li> Setup a bcache device from two loop devices. </li>
 746
 747         <li> Create a file system of a bcache device and mount it. Detach
 748         the cache device while the file system is mounted. </li>
 749
 750         <li> Setup a dm-cache device from two loop devices. </li>
 751
 752         <li> Setup a thin pool where the data LV is a dm-cache device.</li>
 753
 754         <li> Explain the point of dm-cache's <em>passthrough</em> mode.</li>
 755
 756 </ul>
 757
 758 HOMEWORK(«
 759
 760 Explain why small writes to a file system which is stored on a
 761 parity raid result in read-modify-write (RMW) updates. Explain why
 762 RMW updates are particularly expensive and how raid implementations
 763 and block layer caches try to avoid them.
 764
 765 »)
 766
 767 HOMEWORK(«
 768
 769 Recall the concepts of writeback and writethrough. Describe what
 770 each mode means for a hardware device and for a bcache/dm-cache
 771 device. Explain why writeback is faster and writethrough is safer.
 772
 773 »)
 774
 775 HOMEWORK(«
 776
 777 TRIM and UNMAP are special commands in the ATA/SCSI command sets
 778 which inform an SSD that certain data blocks are no longer in use,
 779 allowing the SSD to re-use these blocks to increase performance and
 780 to reduce wear. Subsequent reads from the trimmed data blocks will
 781 not return any meaningful data. For example, the <code> mkfs </code>
 782 commands sends this command to discard all blocks of the device.
 783 Discuss the implications when <code> mkfs. </code> is run on a device
 784 provided by bcache or dm-cache.
 785
 786 »)
 787
 788 SECTION(«The dm-crypt Target»)
 789
 790 <p> This device mapper target provides encryption of arbitrary block
 791 devices by employing the primitives of the crypto API of the Linux
 792 kernel. This API provides a uniform interface to a large number of
 793 cipher algorithms which have been implemented with performance and
 794 security in mind. </p>
 795
 796 <p> The cipher algorithm of choice for the encryption of block devices
 797 is the <em> Advanced Encryption Standard </em> (AES), also known
 798 as <em> Rijndael</em>, named after the two Belgian cryptographers
 799 Rijmen and Daemen who proposed the algorithm in 1999. AES is a <em>
 800 symmetric block cipher</em>. That is, a transformation which operates
 801 on fixed-length blocks and which is determined by a single key for both
 802 encryption and decryption. The underlying algorithm is fairly simple,
 803 which makes AES perform well in both hardware and software. Also
 804 the key setup time and the memory requirements are excellent. Modern
 805 processors of all manufacturers include instructions to perform AES
 806 operations in hardware, improving speed and security. </p>
 807
 808 <p> According to the Snowden documents, the NSA has been doing research
 809 on breaking AES for a long time without being able to come up with
 810 a practical attack for 256 bit keys. Successful attacks invariably
 811 target the key management software instead, which is often implemented
 812 poorly, trading security for user-friendliness, for example by
 813 storing passwords weakly encrypted, or by providing a "feature"
 814 which can decrypt the device without knowing the password. </p>
 815
 816 <p> The exercises of this section ask the reader to encrypt a loop device
 817 with AES without relying on any third party key management software </p>.
 818
 819 EXERCISES()
 820 <ul>
 821         <li> Discuss the message of this <a
 822         href="https://xkcd.com/538/">xkcd</a> comic. </li>
 823
 824         <li> How can a hardware implementation of an algorithm like AES
 825         improve security? After all, it is the same algorithm that is
 826         implemented. </li>
 827
 828         <li> What's the point of the <a href="#random_stream">rstream.c</a>
 829         program below which writes random data to stdout? Doesn't <code>
 830         cat /dev/urandom </code> do the same? </li>
 831
 832         <li> Compile and run <a href="#random_stream">rstream.c</a> to create
 833         a 10G local file and create the loop device <code> /dev/loop0 </code>
 834         from the file. </li>
 835
 836         <li> A <em> table </em> for the <code> dmsetup(8) </code> command is
 837         a single line of the form <code> start_sector num_sectors target_type
 838         target_args</code>. Determine the correct values for the first three
 839         arguments to encrypt <code> /dev/loop0</code>. </li>
 840
 841         <li> The <code> target_args </code> for the dm-crypt target are
 842         of the form <code> cipher key iv_offset device offset</code>. To
 843         encrypt <code> /dev/loop0 </code> with AES-256, <code> cipher </code>
 844         is <code> aes</code>, device is <code> /dev/loop0 </code> and both
 845         offsets are zero. Come up with an idea to create a 256 bit key from
 846         a passphrase. </li>
 847
 848         <li> The <code> create </code> subcommand of <code> dmsetup(8)
 849         </code> creates a device from the given table. Run a command of
 850         the form <code> echo "$table" | dmsetup create cryptdev </code>
 851         to create the encrypted device <code> /dev/mapper/cryptdev </code>
 852         from the loop device. </li>
 853
 854         <li> Create a file system on <code> /dev/mapper/cryptdev</code>,
 855         mount it and create the file <code> passphrase </code> containing
 856         the string "super-secret" on this file system. </li>
 857
 858         <li> Unmount the <code> cryptdev </code> device and run <code> dmsetup
 859         remove cryptdev</code>. Run <code> strings </code> on the loop device
 860         and on the underlying file to see if it contains the string <code>
 861         super-secret" </code> or <code> passphrase</code>. </li>
 862
 863         <li> Re-create the <code> cryptdev </code> device, but this time use
 864         a different (hence invalid) key. Guess what happens and confirm. </li>
 865
 866         <li> Write a script which disables echoing (<code>stty -echo</code>),
 867         reads a passphrase from stdin and combines the above steps to create
 868         and mount an encrypted device. </li>
 869
 870 </ul>
 871
 872 HOMEWORK(«
 873
 874 Why is it a good idea to overwrite a block device with random data
 875 before it is encrypted?
 876
 877 »)
 878
 879 HOMEWORK(«
 880
 881 The dm-crypt target encrypts whole block devices. An alternative is
 882 to encrypt on the file system level. That is, each file is encrypted
 883 separately. Discuss the pros and cons of both approaches.
 884
 885 »)
 886
 887 SUPPLEMENTS()
 888
 889 SUBSECTION(«Random stream»)
 890
 891 <pre>
 892         <code>
 893                 /* Link with -lcrypto */
 894                 #include &lt;openssl/rand.h&gt;
 895                 #include &lt;stdio.h&gt;
 896                 #include &lt;unistd.h&gt;
 897                 #include &lt;stdio.h&gt;
 898
 899                 int main(int argc, char **argv)
 900                 {
 901                         unsigned char buf[1024 * 1024];
 902
 903                         for (;;) {
 904                                 int ret = RAND_bytes(buf, sizeof(buf));
 905
 906                                 if (ret &lt;= 0) {
 907                                         fprintf(stderr, "RAND_bytes() error\n");
 908                                         exit(EXIT_FAILURE);
 909                                 }
 910                                 ret = write(STDOUT_FILENO, buf, sizeof(buf));
 911                                 if (ret &lt; 0) {
 912                                         perror("write");
 913                                         exit(EXIT_FAILURE);
 914                                 }
 915                         }
 916                         return 0;
 917                 }
 918         </code>
 919 </pre>