LVM.m4

   1 TITLE(«
   2
   3         Who the heck is General Failure, and why is he reading my disk? -- Unknown
   4
   5 », __file__)
   6
   7 OVERVIEW(«
   8
   9 The idea of Logical Volume Management is to decouple data and
  10 storage. This offers great flexibility in managing storage and reduces
  11 server downtimes because the storage may be replaced while file
  12 systems are mounted read-write and applications are actively using
  13 them. This chapter provides an introduction to the Linux block layer
  14 and LVM. Subsequent sections cover selected device mapper targets.
  15
  16 »)
  17
  18 SECTION(«The Linux Block Layer»)
  19
  20 <p> The main task of LVM is the management of block devices, so it is
  21 natural to start an introduction to LVM with a section on the Linux
  22 block layer, which is the central component in the Linux kernel
  23 for the handling of persistent storage devices. The mission of the
  24 block layer is to provide a uniform interface to different types
  25 of storage devices. The obvious in-kernel users of this interface
  26 are the file systems and the swap subsystem. But also <em> stacking
  27 device drivers </em> like LVM, Bcache and MD access block devices
  28 through this interface to create virtual block devices from other block
  29 devices. Some user space programs (<code>fdisk, dd, mkfs, ...</code>)
  30 also need to access block devices. The block layer allows them to
  31 perform their task in a well-defined and uniform manner through
  32 block-special device files. </p>
  33
  34 <p> The userspace programs and the in-kernel users interact with the block
  35 layer by sending read or write requests. A <em>bio</em> is the central
  36 data structure that carries such requests within the kernel. Bios
  37 may contain an arbitrary amount of data. They are given to the block
  38 layer to be queued for subsequent handling. Often a bio has to travel
  39 through a stack of block device drivers where each driver modifies
  40 the bio and sends it on to the next driver. Typically, only the last
  41 driver in the stack corresponds to a hardware device. </p>
  42
  43 <p> Besides requests to read or write data blocks, there are various other
  44 bio requests that carry SCSI commands like FLUSH, FUA (Force Unit
  45 Access), TRIM and UNMAP. FLUSH and FUA ensure that certain data hits
  46 stable storage. FLUSH asks the the device to write out the contents of
  47 its volatile write cache while a FUA request carries data that should
  48 be written directly to the device, bypassing all caches. UNMAP/TRIM is
  49 a SCSI/ATA command which is only relevant to SSDs. It is a promise of
  50 the OS to not read the given range of blocks any more, so the device
  51 is free to discard the contents and return arbitrary data on the
  52 next read. This helps the device to level out the number of times
  53 the flash storage cells are overwritten (<em>wear-leveling</em>),
  54 which improves the durability of the device. </p>
  55
  56 <p> The first task of the block layer is to split incoming bios if
  57 necessary to make them conform to the size limit or the alignment
  58 requirements of the target device, and to batch and merge bios so that
  59 they can be submitted as a unit for performance reasons. The thusly
  60 processed bios then form an I/O request which is handed to an <em>
  61 I/O scheduler </em> (also known as <em> elevator</em>). </p>
  62
  63 <p> At this time of writing (2018-11) there exist two different sets
  64 of schedulers: the traditional single-queue schedulers and the
  65 modern multi-queue schedulers, which are expected to replace the
  66 single-queue schedulers soon. The three single-queue schedulers,
  67 noop, deadline and cfq (complete fair queueing), were designed for
  68 rotating disks. They reorder requests with the aim to minimize seek
  69 time. The newer multi-queue schedulers, mq-deadline, kyber, and bfq
  70 (budget fair queueing), aim to max out even the fastest devices. As
  71 implied by the name "multi-queue", they implement several request
  72 queues, the number of which depends on the hardware in use. This
  73 has become necessary because modern storage hardware allows multiple
  74 requests to be submitted in parallel from different CPUs. Moreover,
  75 with many CPUs the locking overhead required to put a request into
  76 a queue increases. Per-CPU queues allow for per-CPU locks, which
  77 decreases queue lock contention. </p>
  78
  79 <p> We will take a look at some aspects of the Linux block layer and on
  80 the various I/O schedulers. An exercise on loop devices enables the
  81 reader to create block devices for testing. This will be handy in
  82 the subsequent sections on LVM specific topics. </p>
  83
  84 EXERCISES()
  85
  86 <ul>
  87
  88         <li> Run <code>find /dev -type b</code> to get the list of all block
  89         devices on your system. Explain which is which. </li>
  90
  91         <li> Examine the files in <code>/sys/block/sda</code>, in
  92         particular <code>/sys/block/sda/stat</code>. Search the web for
  93         <code>Documentation/block/stat.txt</code> for the meaning of the
  94         numbers shown. Then run <code>iostat -xdh sda 1</code>. </li>
  95
  96         <li> Examine the files in <code>/sys/block/sda/queue</code>. </li>
  97
  98         <li> Find out how to determine the size of a block device. </li>
  99
 100         <li> Figure out a way to identify the name of all block devices which
 101         correspond to SSDs (i.e., excluding any rotating disks). </li>
 102
 103         <li> Run <code>lsblk</code> and discuss
 104         the output. Too easy? Run <code>lsblk -o
 105         KNAME,PHY-SEC,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,RQ-SIZE,ROTA,SCHED</code>
 106         </li>
 107
 108         <li> What's the difference between a task scheduler and an I/O
 109         scheduler? </li>
 110
 111         <li> Why are I/O schedulers also called elevators? </li>
 112
 113         <li> How can one find out which I/O schedulers are supported on a
 114         system and which scheduler is active for a given block device? </li>
 115
 116         <li> Is it possible (and safe) to change the I/O scheduler for a
 117         block device while it is in use? If so, how can this be done? </li>
 118
 119         <li> The loop device driver of the Linux kernel allows privileged
 120         users to create a block device from a regular file stored on a file
 121         system.  The resulting block device is called a <em>loop</em> device.
 122         Create a 1G large temporary file containing only zeroes. Run a suitable
 123         <code>losetup(8)</code> command to create a loop device from the
 124         file. Create an XFS file system on the loop device and mount it. </li>
 125
 126 </ul>
 127
 128 HOMEWORK(«
 129
 130 <ul>
 131         <li> Come up with three different use cases for loop devices. </li>
 132
 133         <li> Given a block device node in <code> /dev</code>, how can one
 134         tell that it is a loop device? </li>
 135
 136         <li> Describe the connection between loop devices created by
 137         <code>losetup(8)</code> and the loopback device used for network
 138         connections from the machine to itself. </li>
 139
 140 </ul>
 141 »)
 142
 143 define(«svg_disk», «
 144         <g
 145                 fill="$5"
 146                 stroke="black"
 147                 stroke-width="1"
 148         >
 149         <ellipse
 150                 cx="eval($1 + $3 / 2)"
 151                 cy="eval($2 + $4)"
 152                 rx="eval($3 / 2)"
 153                 ry="eval($3 / 4)"
 154         />
 155         <rect
 156                 x="$1"
 157                 y="$2"
 158                 width="$3"
 159                 height="$4"
 160         />
 161         <rect
 162                 x="eval($1 + 1)"
 163                 y="eval($2 + $4 - 1)"
 164                 width="eval($3 - 2)"
 165                 height="2"
 166                 stroke="$5"
 167         />
 168         <ellipse
 169                 cx="eval($1 + $3 / 2)"
 170                 cy="$2"
 171                 rx="eval($3 / 2)"
 172                 ry="eval($3 / 4)"
 173         />
 174         </g>
 175 »)
 176
 177 SECTION(«Physical and Logical Volumes, Volume Groups»)
 178
 179 <p> Getting started with the Logical Volume Manager (LVM) requires to
 180 get used to a minimal set of vocabulary. This section introduces
 181 the words named in the title of the section, and a couple more.
 182 The basic concepts of LVM are then described in terms of these words. </p>
 183
 184 <div>
 185 define(lvm_width», «300»)
 186 define(«lvm_height», «183»)
 187 define(«lvm_margin», «10»)
 188 define(«lvm_extent_size», «10»)
 189 define(«lvm_extent», «
 190         <rect
 191                 fill="$1"
 192                 x="$2"
 193                 y="$3"
 194                 width="lvm_extent_size()"
 195                 height="lvm_extent_size()"
 196                 stroke="black"
 197                 stroke-width="1"
 198         />
 199 »)
 200 dnl $1: color, $2: x, $3: y, $4: number of extents
 201 define(«lvm_extents», «
 202         ifelse(«$4», «0», «», «
 203                 lvm_extent(«$1», «$2», «$3»)
 204                 lvm_extents(«$1», eval($2 + lvm_extent_size() + lvm_margin()),
 205                         «$3», eval($4 - 1))
 206         »)
 207 »)
 208 dnl $1: x, $2: y, $3: number of extents, $4: disk color, $5: extent color
 209 define(«lvm_disk», «
 210         ifelse(eval(«$3» > 3), «1», «
 211                 pushdef(«h», «eval(7 * lvm_extent_size())»)
 212                 pushdef(«w», «eval(($3 + 1) * lvm_extent_size())»)
 213         », «
 214                 pushdef(«h», «eval(3 * lvm_extent_size() + lvm_margin())»)
 215                 pushdef(«w», «eval($3 * lvm_extent_size() * 2)»)
 216         »)
 217         svg_disk(«$1», «$2», «w()», «h()», «$4»)
 218         ifelse(eval(«$3» > 3), «1», «
 219                 pushdef(«n1», eval(«$3» / 2))
 220                 pushdef(«n2», eval(«$3» - n1()))
 221                 lvm_extents(«$5»,
 222                         eval(«$1» + (w() - (2 * n1() - 1) * lvm_extent_size()) / 2),
 223                         eval(«$2» + h() / 2 - lvm_extent_size()), «n1()»)
 224                 lvm_extents(«$5»,
 225                         eval(«$1» + (w() - (2 * n2() - 1) * lvm_extent_size()) / 2),
 226                         eval(«$2» + h() / 2 + 2 * lvm_extent_size()), «n2()»)
 227                 popdef(«n1»)
 228                 popdef(«n2»)
 229         », «
 230                 lvm_extents(«$5»,
 231                         eval(«$1» + (w() - (2 * «$3» - 1) * lvm_extent_size()) / 2),
 232                         eval(«$2» + h() / 2), «$3»)
 233         »)
 234         popdef(«w»)
 235         popdef(«h»)
 236 »)
 237 <svg
 238         width="lvm_width()" height="lvm_height()"
 239         xmlns="http://www.w3.org/2000/svg"
 240         xmlns:xlink="http://www.w3.org/1999/xlink"
 241 >
 242         <rect
 243                 x=1
 244                 y=1
 245                 width="140"
 246                 height="180"
 247                 fill="green"
 248                 rx="10"
 249                 stroke-width="1"
 250                 stroke="black"
 251         />
 252         lvm_disk(«20», «20», «2», «#666», «yellow»)
 253         lvm_disk(«10», «90», «4», «#666», «yellow»)
 254         lvm_disk(«70», «55», «5», «#666», «yellow»)
 255         <path
 256                 d="
 257                         M 155 91
 258                         l 20 0
 259                         m 0 0
 260                         l -4 -3
 261                         l 0 6
 262                         l 4 -3
 263                         z
 264                 "
 265                 stroke-width="4"
 266                 stroke="black"
 267                 fill="black"
 268         />
 269         lvm_disk(«190», «22», «7», «#66f», «orange»)
 270         lvm_disk(«220», «130», «1», «#66f», «orange»)
 271 </svg>
 272 </div>
 273
 274 <p> A <em> Physical Volume</em> (PV, grey) is an arbitrary block device which
 275 contains a certain metadata header (also known as <em>superblock</em>)
 276 at the start. PVs can be partitions on a local hard disk or a SSD,
 277 a soft- or hardware raid, or a loop device. LVM does not care.
 278 The storage space on a physical volume is managed in units called <em>
 279 Physical Extents </em> (PEs, yellow). The default PE size is 4M. </p>
 280
 281 <p> A <em>Volume Group</em> (VG, green) is a non-empty set of PVs with
 282 a name and a unique ID assigned to it. A PV can but doesn't need to
 283 be assigned to a VG. If it is, the ID of the associated VG is stored
 284 in the metadata header of the PV. </p>
 285
 286 <p> A <em> Logical Volume</em> (LV, blue) is a named block device which is
 287 provided by LVM. LVs are always associated with a VG and are stored
 288 on that VG's PVs. Since LVs are normal block devices, file systems
 289 of any type can be created on them, they can be used as swap storage,
 290 etc. The chunks of a LV are managed as <em>Logical Extents</em> (LEs,
 291 orange). Often the LE size equals the PE size.  For each LV there is
 292 a mapping between the LEs of the LV and the PEs of the underlying
 293 PVs. The PEs can spread multiple PVs. </p>
 294
 295 <p> VGs can be extended by adding additional PVs to it, or reduced by
 296 removing unused devices, i.e., those with no PEs allocated on them. PEs
 297 may be moved from one PV to another while the LVs are active. LVs
 298 may be grown or shrunk. To grow a LV, there must be enough space
 299 left in the VG. Growing a LV does not magically grow the file system
 300 stored on it, however. To make use of the additional space, a second,
 301 file system specific step is needed to tell the file system that it's
 302 underlying block device (the LV) has grown. </p>
 303
 304 <p> The exercises of this section illustrate the basic LVM concepts
 305 and the essential LVM commands. They ask the reader to create a VG
 306 whose PVs are loop devices. This VG is used as a starting point in
 307 subsequent chapters. </p>
 308
 309 EXERCISES()
 310
 311 <ul>
 312
 313         <li> Create two 5G large loop devices <code>/dev/loop1</code>
 314         and <code>/dev/loop2</code>. Make them PVs by running
 315         <code>pvcreate</code>. Create a VG <code>tvg</code> (test volume group)
 316         from the two loop devices and two 3G large LVs named <code>tlv1</code>
 317         and <code>tlv2</code> on it. Run the <code>pvcreate, vgcreate</code>,
 318         and <code>lvcreate</code> commands with <code>-v</code> to activate
 319         verbose output and try to understand each output line. </li>
 320
 321         <li> Run <code>pvs, vgs, lvs, lvdisplay, pvdisplay</code> and examine
 322         the output. </li>
 323
 324         <li> Run <code>lvdisplay -m</code> to examine the mapping of logical
 325         extents to PVs and physical extents. </li>
 326
 327         <li> Run <code>pvs --segments -o+lv_name,seg_start_pe,segtype</code>
 328         to see the map between physical extents and logical extents. </li>
 329
 330 </ul>
 331
 332 HOMEWORK(«
 333
 334 In the above scenario (two LVs in a VG consisting of two PVs), how
 335 can you tell whether both PVs are actually used? Remove the LVs
 336 with <code>lvremove</code>. Recreate them, but this time use the
 337 <code>--stripes 2</code> option to <code>lvcreate</code>. Explain
 338 what this option does and confirm with a suitable command.
 339
 340 »)
 341
 342 SECTION(«Device Mapper and Device Mapper Targets»)
 343
 344 <p> The kernel part of the Logical Volume Manager (LVM) is called
 345 <em>device mapper</em> (DM), which is a generic framework to map
 346 one block device to another. Applications talk to the Device Mapper
 347 via the <em>libdevmapper</em> library, which issues requests
 348 to the <code>/dev/mapper/control</code> character device using the
 349 <code>ioctl(2)</code> system call. The device mapper is also accessible
 350 from scripts via the <code>dmsetup(8)</code> tool. </p>
 351
 352 <p> A DM target represents one particular mapping type for ranges
 353 of LEs. Several DM targets exist, each of which which creates and
 354 maintains block devices with certain characteristics. In this section
 355 we take a look at the <code>dmsetup</code> tool and the relatively
 356 simple <em>mirror</em> target. Subsequent sections cover other targets
 357 in more detail. </p>
 358
 359 EXERCISES()
 360
 361 <ul>
 362
 363         <li> Run <code>dmsetup targets</code> to list all targets supported
 364         by the currently running kernel. Explain their purpose and typical
 365         use cases. </li>
 366
 367         <li> Starting with the <code>tvg</code> VG, remove <code>tlv2</code>.
 368         Convince yourself by running <code>vgs</code> that <code>tvg</code>
 369         is 10G large, with 3G being in use. Run <code>pvmove
 370         /dev/loop1</code> to move the used PEs of <code>/dev/loop1</code>
 371         to <code>/dev/loop2</code>. After the command completes, run
 372         <code>pvs</code> again to see that <code>/dev/loop1</code> has no
 373         more PEs in use. </li>
 374
 375         <li> Create a third 5G loop device <code>/dev/loop3</code>, make it a
 376         PV and extend the VG with <code>vgextend tvg /dev/loop3</code>. Remove
 377         <code>tlv1</code>. Now the LEs of <code>tlv2</code> fit on any
 378         of the three PVs.  Come up with a command which moves them to
 379         <code>/dev/loop3</code>. </li>
 380
 381         <li> The first two loop devices are both unused. Remove them from
 382         the VG with <code>vgreduce -a</code>. Why are they still listed in
 383         the <code>pvs</code> output? What can be done about that? </li>
 384
 385 </ul>
 386
 387 HOMEWORK(«
 388
 389 As advertised in the introduction, LVM allows the administrator to
 390 replace the underlying storage of a file system online. This is done
 391 by running a suitable <code>pvmove(8)</code> command to move all PEs of
 392 one PV to different PVs in the same VG.
 393
 394 <ul>
 395
 396         <li> Explain the mapping type of dm-mirror. </li>
 397
 398         <li> The traditional way to mirror the contents of two or more block
 399         devices is software raid 1, also known as <em>md raid1</em> ("md"
 400         is short for multi-disk). Explain the difference between md raid1,
 401         the dm-raid target which supports raid1 and other raid levels, and
 402         the dm-mirror target. </li>
 403
 404         <li> Guess how <code>pvmove</code> is implemented on top of
 405         dm-mirror. Verify your guess by reading the "NOTES" section of the
 406         <code>pvmove(8)</code> man page. </li>
 407
 408 </ul>
 409 »)
 410
 411 SECTION(«LVM Snapshots»)
 412
 413 <p> LVM snapshots are based on the CoW optimization strategy described
 414 earlier in the chapter on <a href="./Unix_Concepts.html#processes">Unix
 415 Concepts</a>. Creating a snapshot means to create a CoW table of the
 416 given size. Just before a LE of a snapshotted LV is about to be written
 417 to, its contents are copied to a free slot in the CoW table. This
 418 preserves an old version of the LV, the snapshot, which can later be
 419 reconstructed by overlaying the CoW table atop the LV. </p>
 420
 421 <p> Snapshots can be taken from a LV which contains a mounted file system,
 422 while applications are actively modifying files. Without coordination
 423 between the file system and LVM, the file system most likely has memory
 424 buffers scheduled for writeback. These outstanding writes did not make
 425 it to the snapshot, so one can not expect the snapshot to contain a
 426 consistent file system image. Instead, it is in a similar state as a
 427 regular device after an unclean shutdown. This is not a problem for
 428 XFS and EXT4, as both are <em>journalling</em> file systems, which
 429 were designed with crash recovery in mind. At the next mount after a
 430 crash, journalling file systems replay their journal, which results
 431 in a consistent state. Note that this implies that even a read-only
 432 mount of the snapshot device has to write to the device. </p>
 433
 434 EXERCISES()
 435
 436 <ul>
 437
 438         <li> In the test VG, create a 1G large snapshot named
 439         <code>snap_tlv1</code> of the <code>tlv1</code> VG by using the
 440         <code>-s</code> option to <code>lvcreate(8)</code>. Predict how much
 441         free space is left in the VG. Confirm with <code>vgs tvg</code>. </li>
 442
 443         <li> Create an EXT4 file system on <code>tlv1</code> by running
 444         <code>mkfs.ext4 /dev/tvg/lv1</code>. Guess how much of the snapshot
 445         space has been allocated by this operation. Check with <code>lvs
 446         tvg1/snap_lv1</code>. </li>
 447
 448         <li> Remove the snapshot with <code>lvremove</code> and recreate
 449         it. Repeat the previous step, but this time run <code>mkfs.xfs</code>
 450         to create an XFS file system. Run <code>lvs tvg/snap_lv1</code>
 451         again and compare the used snapshot space to the EXT4 case. Explain
 452         the difference. </li>
 453
 454         <li> Remove the snapshot and recreate it so that both <code>tlv1</code>
 455         and <code>snap_tlv1</code> contain a valid XFS file system. Mount
 456         the file systems on <code>/mnt/1</code> and <code>/mnt/2</code>. </li>
 457
 458         <li> Run <code>dd if=/dev/zero of=/mnt/1/zero count=$((2 * 100 *
 459         1024))</code> to create a 100M large file on <code>tlv1</code>. Check
 460         that <code>/mnt/2</code> is still empty. Estimate how much of the
 461         snapshot space is used and check again. </li>
 462
 463         <li> Repeat the above <code>dd</code> command 5 times and run
 464         <code>lvs</code> again. Explain why the used snapshot space did not
 465         increase. </li>
 466
 467         <li> It is possible to create snapshots of snapshots. This is
 468         implemented by chaining together CoW tables. Describe the impact on
 469         performance. </li>
 470
 471         <li> Suppose a snapshot was created before significant modifications
 472         were made to the contents of the LV, for example an upgrade of a large
 473         software package. Assume that the user wishes to permanently return to
 474         the old version because the upgrade did not work out. In this scenario
 475         it is the snapshot which needs to be retained, rather than the original
 476         LV.  In view of this scenario, guess what happens on the attempt to
 477         remove a LV which is being snapshotted. Unmount <code>/mnt/1</code>
 478         and confirm by running <code>lvremove tvg/lv1</code>. </li>
 479
 480         <li> Come up with a suitable <code>lvconvert</code> command which
 481         replaces the role of the LV and its snapshot. Explain why this solves
 482         the "bad upgrade" problem outlined above. </li>
 483
 484         <li> Explain what happens if the CoW table fills up. Confirm by
 485         writing a file larger than the snapshot size. </li>
 486
 487 </ul>
 488
 489 SECTION(«Thin Provisioning»)
 490
 491 <p> The term "thin provisioning" is just a modern buzzword for
 492 over-subscription. Both terms mean to give the appearance of having
 493 more resources than are actually available. This is achieved by
 494 on-demand allocation. The thin provisioning implementation of Linux
 495 is implemented as a DM target called <em>dm-thin</em>. This code
 496 first made its appearance in 2011 and was declared as stable two
 497 years later. These days it should be safe for production use. </p>
 498
 499 <p> The general problem with thin provisioning is of course that bad
 500 things happen when the resources are exhausted because the demand has
 501 increased before new resources were added. For dm-thin this can happen
 502 when users write to their allotted space, causing dm-thin to attempt
 503 allocating a data block from a volume which is already full. This
 504 usually leads to severe data corruption because file systems are
 505 not really prepared to handle this error case and treat it as if the
 506 underlying block device had failed. dm-thin does nothing to prevent
 507 this, but one can configure a <em>low watermark</em>.  When the
 508 number of free data blocks drops below the watermark, a so-called
 509 <em>dm-event</em> will be generated to notice the administrator. </p>
 510
 511 <p> One highlight of dm-thin is its efficient support for an arbitrary
 512 depth of recursive snapshots, called <em>dm-thin snapshots</em>
 513 in this document. With the traditional snapshot implementation,
 514 recursive snapshots quickly become a performance issue as the depth
 515 increases. With dm-thin one can have an arbitrary subset of all
 516 snapshots active at any point in time, and there is no ordering
 517 requirement on activating or removing them. </p>
 518
 519 <p> The block devices created by dm-thin always belong to a <em>thin
 520 pool</em> which ties together two LVs called the <em>metadata LV</em>
 521 and the <em>data LV</em>. The combined LV is called the <em>thin pool
 522 LV</em>. Setting up a VG for thin provisioning is done in two steps:
 523 First the standard LVs for data and the metatdata are created. Second,
 524 the two LVs are combined into a thin pool LV. The second step hides
 525 the two underlying LVs so that only the combined thin pool LV is
 526 visible afterwards. Thin provisioned LVs and dm-thin snapshots can
 527 then be created from the thin pool LV with a single command. </p>
 528
 529 <p> Another nice feature of dm-thin are <em>external snapshots</em>.
 530 An external snapshot is one where the origin for a thinly provisioned
 531 device is not a device of the pool. Arbitrary read-only block
 532 devices can be turned into writable devices by creating an external
 533 snapshot. Reads to an unprovisioned area of the snapshot will be passed
 534 through to the origin. Writes trigger the allocation of new blocks as
 535 usual with CoW. One use case for this is VM hosts which run their VMs
 536 on thinly-provisioned volumes but have the base image on some "master"
 537 device which is read-only and can hence be shared between all VMs. </p>
 538
 539 EXERCISES()
 540
 541 <p> Starting with the <code>tvg</code> VG, create and test a thin pool LV
 542 by performing the following steps.  The "Thin Usage" section of
 543 <code>lvmthin(7)</code> will be helpful.
 544
 545 <ul>
 546
 547         <li> Remove the <code>tlv1</code> and <code>tlv2</code> LVs. </li>
 548
 549         <li> Create a 5G data LV named <code>tdlv</code> (thin data LV)
 550         and a 500M LV named <code>tmdlv</code> (thin metada LV). </li>
 551
 552         <li> Combine the two LVs into a thin pool with
 553         <code>lvconvert</code>. Run <code>lvs -a</code> and explain the flags
 554         listed below <code>Attr</code>. </li>
 555
 556         <li> Create a 10G thin LV named <code>oslv</code> (over-subscribed
 557         LV). </li>
 558
 559         <li> Create an XFS file system on <code>oslv</code> and mount it on
 560         <code>/mnt</code>. </li>
 561
 562         <li> Run a loop of the form <code>for ((i = 0; i &lt; 50; i++)): do
 563         ... ; done</code> so that each iteration creates a 50M file named
 564         <code>file-$i</code> and a snapshot named <code>snap_oslv-$i</code>
 565         of <code>oslv</code>. </li>
 566
 567         <li> Activate an arbitrary snapshot with <code>lvchange -K</code> and
 568         try to mount it. Explain what the error message means. Then read the
 569         "XFS on snapshots" section of <code>lvmthin(7)</code>. </li>
 570
 571         <li> Check the available space of the data LV with <code>lvs
 572         -a</code>. Mount one snapshot (specifying <code>-o nouuid</code>)
 573         and run <code>lvs -a</code> again.  Why did the free space decrease
 574         although no new files were written? </li>
 575
 576         <li> Mount four different snapshots and check that they contain the
 577         expected files. </li>
 578
 579         <li> Remove all snapshots. Guess what <code>lvs -a</code> and <code>dh
 580         -h /mnt</code> report. Then run the commands to confirm. Guess
 581         what happens if you try to create another 3G file? Confirm
 582         your guess, then read the section on "Data space exhaustion" of
 583         <code>lvmthin(7)</code>. </li>
 584
 585 </ul>
 586
 587 HOMEWORK(«
 588
 589 When a thin pool provisions a new data block for a thin LV, the new
 590 block is first overwritten with zeros by default. Discuss why this
 591 is done, its impact on performance and security, and conclude whether
 592 or not it is a good idea to turn off the zeroing.
 593
 594 »)
 595
 596 SECTION(«Bcache, dm-cache and dm-writecache»)
 597
 598 <p> All three implementations named in the title of this chapter are <em>
 599 Linux block layer caches</em>. They combine two different block
 600 devices to form a hybrid block device which dynamically caches
 601 and migrates data between the two devices with the aim to improve
 602 performance. One device, the <em> backing device</em>, is expected
 603 to be large and slow while the other one, the <em>cache device</em>,
 604 is expected to be small and fast. </p>
 605
 606 <div>
 607 define(«bch_width», «300»)
 608 define(«bch_height», «130»)
 609 define(«bch_margin», «10»)
 610 define(«bch_rraid_width», «eval((bch_width() - 4 * bch_margin()) * 4 / 5)»)
 611 define(«bch_raidbox_height», «eval(bch_height() - 2 * bch_margin())»)
 612 define(«bch_nraid_width», «eval(bch_rraid_width() / 4)»)
 613 define(«bch_rdisk_width», «eval((bch_width() - 3 * bch_margin()) * 18 / 100)»)
 614 define(«bch_rdisk_height», «eval((bch_height() - 4 * bch_margin()) / 3)»)
 615 define(«bch_ndisk_width», «eval(bch_rdisk_width() / 2)»)
 616 define(«bch_ndisk_height», «eval(bch_raidbox_height() - 5 * bch_margin())»)
 617 define(«bch_rdisk», «svg_disk(«$1», «$2»,
 618         «bch_rdisk_width()», «bch_rdisk_height()», «#666»)»)
 619 define(«bch_ndisk», «svg_disk(«$1», «$2»,
 620         «bch_ndisk_width()», «bch_ndisk_height()», «#66f»)»)
 621 define(«bch_5rdisk», «
 622         bch_rdisk(«$1», «$2»)
 623         bch_rdisk(«eval($1 + bch_margin())»,
 624                 «eval($2 + bch_margin())»)
 625         bch_rdisk(«eval($1 + 2 * bch_margin())»,
 626                 «eval($2 + 2 * bch_margin())»)
 627         bch_rdisk(«eval($1 + 3 * bch_margin())»,
 628                 «eval($2 + 3 * bch_margin())»)
 629         bch_rdisk(«eval($1 + 4 * bch_margin())»,
 630                 «eval($2 + 4 * bch_margin())»)
 631
 632 »)
 633 define(«bch_rraid», «
 634         <rect
 635                 fill="#3b3"
 636                 stroke="black"
 637                 x="$1"
 638                 y="$2"
 639                 width="bch_rraid_width()"
 640                 height="bch_raidbox_height()"
 641                 rx=10
 642         />
 643         bch_5rdisk(«eval($1 + bch_margin())»,
 644                 «eval($2 + 2 * bch_margin())»)
 645         bch_5rdisk(«eval($1 + 2 * bch_rdisk_width() + bch_margin())»,
 646                 «eval($2 + 2 * bch_margin())»)
 647 »)
 648 define(«bch_nraid», «
 649         <rect
 650                 fill="orange"
 651                 stroke="black"
 652                 x="$1"
 653                 y="$2"
 654                 width="bch_nraid_width()"
 655                 height="bch_raidbox_height()"
 656                 rx=10
 657         />
 658         bch_ndisk(eval($1 + bch_margin()),
 659                 eval($2 + 2 * bch_margin()))
 660         bch_ndisk(eval($1 + 2 * bch_margin()),
 661                 eval($2 + 3 * bch_margin()))
 662 »)
 663
 664 <svg
 665         width="bch_width()" height="bch_height()"
 666         xmlns="http://www.w3.org/2000/svg"
 667         xmlns:xlink="http://www.w3.org/1999/xlink"
 668 >
 669         <rect
 670                 fill="#cc2"
 671                 stroke="black"
 672                 stroke-width="1"
 673                 x="1"
 674                 y="1"
 675                 width="eval(bch_rraid_width() + bch_nraid_width()
 676                         + 3 * bch_margin() - 2)"
 677                 height="eval(bch_raidbox_height() + 2 * bch_margin() - 2)"
 678                 rx="10"
 679         />
 680         bch_nraid(«bch_margin()», «bch_margin()»)
 681         bch_rraid(«eval(2 * bch_margin() + bch_nraid_width())», «bch_margin()»)
 682 </svg>
 683 </div>
 684
 685 <p> The most simple setup consists of a single rotating disk and one SSD.
 686 The setup shown in the diagram at the left is realistic for a large
 687 server with redundant storage.  In this setup the hybrid device
 688 (yellow) combines a raid6 array (green) consisting of many rotating
 689 disks (grey) with a two-disk raid1 array (orange) stored on fast
 690 NVMe devices (blue). In the simple setup it is always a win when
 691 I/O is performed from/to the SSD instead of the rotating disk. In
 692 the server setup, however, it depends on the workload which device
 693 is faster. Given enough rotating disks and a streaming I/O workload,
 694 the raid6 outperforms the raid1 because all disks can read or write
 695 at full speed. </p>
 696
 697 <p> Since block layer caches hook into the Linux block API described <a
 698 href="«#»the_linux_block_layer">earlier</a>, the hybrid block devices
 699 they provide can be used like any other block device. In particular,
 700 the hybrid devices are <em> file system agnostic</em>, meaning that
 701 any file system can be created on them. In what follows we briefly
 702 describe the differences between the three block layer caches and
 703 conclude with the pros and cons of each. </p>
 704
 705 <p> Bcache is a stand-alone stacking device driver which was
 706 included in the Linux kernel in 2013. According to the <a
 707 href="https://bcache.evilpiepirate.org/">bcache home page</a>, it
 708 is "done and stable". dm-cache and dm-writecache are device mapper
 709 targets included in 2013 and 2018, respectively, which are both marked
 710 as experimental. In contrast to dm-cache, dm-writecache only caches
 711 writes while reads are supposed to be cached in RAM. It has been
 712 designed for programs like databases which need low commit latency.
 713 Both bcache and dm-cache can operate in writeback or writethrough
 714 mode while dm-writecache always operates in writeback mode. </p>
 715
 716 <p> The DM-based caches are designed to leave the decision as to what
 717 data to migrate (and when) to user space while bcache has this policy
 718 built-in. However, at this point only the <em> Stochastic Multiqueue
 719 </em> (smq) policy for dm-cache exists, plus a second policy which
 720 is only useful for decommissioning the cache device. There are no
 721 tunables for dm-cache while all the bells and whistles of bcache can
 722 be configured through sysfs files.  Another difference is that bcache
 723 detects sequential I/O and separates it from random I/O so that large
 724 streaming reads and writes bypass the cache and don't push cached
 725 randomly accessed data out of the cache. </p>
 726
 727 <p> bcache is the clear  winner of this comparison because it is stable,
 728 configurable and performs better at least on the server setup
 729 described above because it separate random and sequential I/O. The
 730 only advantage of dm-cache is its flexibility because cache policies
 731 can be switched. But even this remains a theoretical advantage as
 732 long as only a single policy for dm-cache exists. </p>
 733
 734 EXERCISES()
 735
 736 <ul>
 737
 738         <li> Recall the concepts of writeback and writethrough and explain
 739         why writeback is faster and writethrough is safer. </li>
 740
 741         <li> Explain how the <em>writearound</em> mode of bcache works and
 742         when it should be used. </li>
 743
 744         <li> Setup a bcache device from two loop devices. </li>
 745
 746         <li> Create a file system of a bcache device and mount it. Detach
 747         the cache device while the file system is mounted. </li>
 748
 749         <li> Setup a dm-cache device from two loop devices. </li>
 750
 751         <li> Setup a thin pool where the data LV is a dm-cache device.</li>
 752
 753         <li> Explain the point of dm-cache's <em>passthrough</em> mode.</li>
 754
 755 </ul>
 756
 757 HOMEWORK(«
 758
 759 Explain why small writes to a file system which is stored on a
 760 parity raid result in read-modify-write (RMW) updates. Explain why
 761 RMW updates are particularly expensive and how raid implementations
 762 and block layer caches try to avoid them.
 763
 764 »)
 765
 766 HOMEWORK(«
 767
 768 Recall the concepts of writeback and writethrough. Describe what
 769 each mode means for a hardware device and for a bcache/dm-cache
 770 device. Explain why writeback is faster and writethrough is safer.
 771
 772 »)
 773
 774 HOMEWORK(«
 775
 776 TRIM and UNMAP are special commands in the ATA/SCSI command sets
 777 which inform an SSD that certain data blocks are no longer in use,
 778 allowing the SSD to re-use these blocks to increase performance and
 779 to reduce wear. Subsequent reads from the trimmed data blocks will
 780 not return any meaningful data. For example, the <code> mkfs </code>
 781 commands sends this command to discard all blocks of the device.
 782 Discuss the implications when <code> mkfs. </code> is run on a device
 783 provided by bcache or dm-cache.
 784
 785 »)
 786
 787 SECTION(«The dm-crypt Target»)
 788
 789 <p> This device mapper target provides encryption of arbitrary block
 790 devices by employing the primitives of the crypto API of the Linux
 791 kernel. This API provides a uniform interface to a large number of
 792 cipher algorithms which have been implemented with performance and
 793 security in mind. </p>
 794
 795 <p> The cipher algorithm of choice for the encryption of block devices
 796 is the <em> Advanced Encryption Standard </em> (AES), also known
 797 as <em> Rijndael</em>, named after the two Belgian cryptographers
 798 Rijmen and Daemen who proposed the algorithm in 1999. AES is a <em>
 799 symmetric block cipher</em>. That is, a transformation which operates
 800 on fixed-length blocks and which is determined by a single key for both
 801 encryption and decryption. The underlying algorithm is fairly simple,
 802 which makes AES perform well in both hardware and software. Also
 803 the key setup time and the memory requirements are excellent. Modern
 804 processors of all manufacturers include instructions to perform AES
 805 operations in hardware, improving speed and security. </p>
 806
 807 <p> According to the Snowden documents, the NSA has been doing research
 808 on breaking AES for a long time without being able to come up with
 809 a practical attack for 256 bit keys. Successful attacks invariably
 810 target the key management software instead, which is often implemented
 811 poorly, trading security for user-friendliness, for example by
 812 storing passwords weakly encrypted, or by providing a "feature"
 813 which can decrypt the device without knowing the password. </p>
 814
 815 <p> The exercises of this section ask the reader to encrypt a loop device
 816 with AES without relying on any third party key management software </p>.
 817
 818 EXERCISES()
 819 <ul>
 820         <li> Discuss the message of this <a
 821         href="https://xkcd.com/538/">xkcd</a> comic. </li>
 822
 823         <li> How can a hardware implementation of an algorithm like AES
 824         improve security? After all, it is the same algorithm that is
 825         implemented. </li>
 826
 827         <li> What's the point of the <a href="#random_stream">rstream.c</a>
 828         program below which writes random data to stdout? Doesn't <code>
 829         cat /dev/urandom </code> do the same? </li>
 830
 831         <li> Compile and run <a href="#random_stream">rstream.c</a> to create
 832         a 10G local file and create the loop device <code> /dev/loop0 </code>
 833         from the file. </li>
 834
 835         <li> A <em> table </em> for the <code> dmsetup(8) </code> command is
 836         a single line of the form <code> start_sector num_sectors target_type
 837         target_args</code>. Determine the correct values for the first three
 838         arguments to encrypt <code> /dev/loop0</code>. </li>
 839
 840         <li> The <code> target_args </code> for the dm-crypt target are
 841         of the form <code> cipher key iv_offset device offset</code>. To
 842         encrypt <code> /dev/loop0 </code> with AES-256, <code> cipher </code>
 843         is <code> aes</code>, device is <code> /dev/loop0 </code> and both
 844         offsets are zero. Come up with an idea to create a 256 bit key from
 845         a passphrase. </li>
 846
 847         <li> The <code> create </code> subcommand of <code> dmsetup(8)
 848         </code> creates a device from the given table. Run a command of
 849         the form <code> echo "$table" | dmsetup create cryptdev </code>
 850         to create the encrypted device <code> /dev/mapper/cryptdev </code>
 851         from the loop device. </li>
 852
 853         <li> Create a file system on <code> /dev/mapper/cryptdev</code>,
 854         mount it and create the file <code> passphrase </code> containing
 855         the string "super-secret" on this file system. </li>
 856
 857         <li> Unmount the <code> cryptdev </code> device and run <code> dmsetup
 858         remove cryptdev</code>. Run <code> strings </code> on the loop device
 859         and on the underlying file to see if it contains the string <code>
 860         super-secret" </code> or <code> passphrase</code>. </li>
 861
 862         <li> Re-create the <code> cryptdev </code> device, but this time use
 863         a different (hence invalid) key. Guess what happens and confirm. </li>
 864
 865         <li> Write a script which disables echoing (<code>stty -echo</code>),
 866         reads a passphrase from stdin and combines the above steps to create
 867         and mount an encrypted device. </li>
 868
 869 </ul>
 870
 871 HOMEWORK(«
 872
 873 Why is it a good idea to overwrite a block device with random data
 874 before it is encrypted?
 875
 876 »)
 877
 878 HOMEWORK(«
 879
 880 The dm-crypt target encrypts whole block devices. An alternative is
 881 to encrypt on the file system level. That is, each file is encrypted
 882 separately. Discuss the pros and cons of both approaches.
 883
 884 »)
 885
 886 SUPPLEMENTS()
 887
 888 SUBSECTION(«Random stream»)
 889
 890 <pre>
 891         <code>
 892                 /* Link with -lcrypto */
 893                 #include &lt;openssl/rand.h&gt;
 894                 #include &lt;stdio.h&gt;
 895                 #include &lt;unistd.h&gt;
 896                 #include &lt;stdio.h&gt;
 897
 898                 int main(int argc, char **argv)
 899                 {
 900                         unsigned char buf[1024 * 1024];
 901
 902                         for (;;) {
 903                                 int ret = RAND_bytes(buf, sizeof(buf));
 904
 905                                 if (ret &lt;= 0) {
 906                                         fprintf(stderr, "RAND_bytes() error\n");
 907                                         exit(EXIT_FAILURE);
 908                                 }
 909                                 ret = write(STDOUT_FILENO, buf, sizeof(buf));
 910                                 if (ret &lt; 0) {
 911                                         perror("write");
 912                                         exit(EXIT_FAILURE);
 913                                 }
 914                         }
 915                         return 0;
 916                 }
 917         </code>
 918 </pre>