[aple.git] / LVM.m4
3 Who the heck is General Failure, and why is he reading my disk? -- Unknown
5 », __file__)
9 The idea of Logical Volume Management is to decouple data and
10 storage. This offers great flexibility in managing storage and reduces
11 server downtimes because the storage may be replaced while file
12 systems are mounted read-write and applications are actively using
13 them. This chapter provides an introduction to the Linux block layer
14 and LVM. Subsequent sections cover selected device mapper targets.
16 »)
18 SECTION(«The Linux Block Layer»)
20 <p> The main task of LVM is the management of block devices, so it is
21 natural to start an introduction to LVM with a section on the Linux
22 block layer, which is the central component in the Linux kernel
23 for the handling of persistent storage devices. The mission of the
24 block layer is to provide a uniform interface to different types
25 of storage devices. The obvious in-kernel users of this interface
26 are the file systems and the swap subsystem. But also <em> stacking
27 device drivers </em> like LVM, Bcache and MD access block devices
28 through this interface to create virtual block devices from other block
29 devices. Some user space programs (<code>fdisk, dd, mkfs, ...</code>)
30 also need to access block devices. The block layer allows them to
31 perform their task in a well-defined and uniform manner through
32 block-special device files. </p>
34 <p> The userspace programs and the in-kernel users interact with the block
35 layer by sending read or write requests. A <em>bio</em> is the central
36 data structure that carries such requests within the kernel. Bios
37 may contain an arbitrary amount of data. They are given to the block
38 layer to be queued for subsequent handling. Often a bio has to travel
39 through a stack of block device drivers where each driver modifies
40 the bio and sends it on to the next driver. Typically, only the last
41 driver in the stack corresponds to a hardware device. </p>
43 <p> Besides requests to read or write data blocks, there are various other
44 bio requests that carry SCSI commands like FLUSH, FUA (Force Unit
45 Access), TRIM and UNMAP. FLUSH and FUA ensure that certain data hits
46 stable storage. FLUSH asks the the device to write out the contents of
47 its volatile write cache while a FUA request carries data that should
48 be written directly to the device, bypassing all caches. UNMAP/TRIM is
49 a SCSI/ATA command which is only relevant to SSDs. It is a promise of
50 the OS to not read the given range of blocks any more, so the device
51 is free to discard the contents and return arbitrary data on the
52 next read. This helps the device to level out the number of times
53 the flash storage cells are overwritten (<em>wear-leveling</em>),
54 which improves the durability of the device. </p>
56 <p> The first task of the block layer is to split incoming bios if
57 necessary to make them conform to the size limit or the alignment
58 requirements of the target device, and to batch and merge bios so that
59 they can be submitted as a unit for performance reasons. The thusly
60 processed bios then form an I/O request which is handed to an <em>
61 I/O scheduler </em> (also known as <em> elevator</em>). </p>
63 <p> At this time of writing (2018-11) there exist two different sets
64 of schedulers: the traditional single-queue schedulers and the
65 modern multi-queue schedulers, which are expected to replace the
66 single-queue schedulers soon. The three single-queue schedulers,
67 noop, deadline and cfq (complete fair queueing), were designed for
68 rotating disks. They reorder requests with the aim to minimize seek
69 time. The newer multi-queue schedulers, mq-deadline, kyber, and bfq
70 (budget fair queueing), aim to max out even the fastest devices. As
71 implied by the name "multi-queue", they implement several request
72 queues, the number of which depends on the hardware in use. This
73 has become necessary because modern storage hardware allows multiple
74 requests to be submitted in parallel from different CPUs. Moreover,
75 with many CPUs the locking overhead required to put a request into
76 a queue increases. Per-CPU queues allow for per-CPU locks, which
77 decreases queue lock contention. </p>
79 <p> We will take a look at some aspects of the Linux block layer and on
80 the various I/O schedulers. An exercise on loop devices enables the
81 reader to create block devices for testing. This will be handy in
82 the subsequent sections on LVM specific topics. </p>
86 <ul>
88 <li> Run <code>find /dev -type b</code> to get the list of all block
89 devices on your system. Explain which is which. </li>
91 <li> Examine the files in <code>/sys/block/sda</code>, in
92 particular <code>/sys/block/sda/stat</code>. Search the web for
93 <code>Documentation/block/stat.txt</code> for the meaning of the
94 numbers shown. Then run <code>iostat -xdh sda 1</code>. </li>
96 <li> Examine the files in <code>/sys/block/sda/queue</code>. </li>
98 <li> Find out how to determine the size of a block device. </li>
100 <li> Figure out a way to identify the name of all block devices which
101 correspond to SSDs (i.e., excluding any rotating disks). </li>
103 <li> Run <code>lsblk</code> and discuss
104 the output. Too easy? Run <code>lsblk -o
106 </li>
108 <li> What's the difference between a task scheduler and an I/O
109 scheduler? </li>
111 <li> Why are I/O schedulers also called elevators? </li>
113 <li> How can one find out which I/O schedulers are supported on a
114 system and which scheduler is active for a given block device? </li>
116 <li> Is it possible (and safe) to change the I/O scheduler for a
117 block device while it is in use? If so, how can this be done? </li>
119 <li> The loop device driver of the Linux kernel allows privileged
120 users to create a block device from a regular file stored on a file
121 system. The resulting block device is called a <em>loop</em> device.
122 Create a 1G large temporary file containing only zeroes. Run a suitable
123 <code>losetup(8)</code> command to create a loop device from the
124 file. Create an XFS file system on the loop device and mount it. </li>
126 </ul>
130 <ul>
131 <li> Come up with three different use cases for loop devices. </li>
133 <li> Given a block device node in <code> /dev</code>, how can one
134 tell that it is a loop device? </li>
136 <li> Describe the connection between loop devices created by
137 <code>losetup(8)</code> and the loopback device used for network
138 connections from the machine to itself. </li>
140 </ul>
141 »)
143 define(«svg_disk», «
144 <g
145 fill="$5"
146 stroke="black"
147 stroke-width="1"
148 >
149 <ellipse
150 cx="eval($1 + $3 / 2)"
151 cy="eval($2 + $4)"
152 rx="eval($3 / 2)"
153 ry="eval($3 / 4)"
154 />
155 <rect
156 x="$1"
157 y="$2"
158 width="$3"
159 height="$4"
160 />
161 <rect
162 x="eval($1 + 1)"
163 y="eval($2 + $4 - 1)"
164 width="eval($3 - 2)"
165 height="2"
166 stroke="$5"
167 />
168 <ellipse
169 cx="eval($1 + $3 / 2)"
170 cy="$2"
171 rx="eval($3 / 2)"
172 ry="eval($3 / 4)"
173 />
174 </g>
175 »)
177 SECTION(«Physical and Logical Volumes, Volume Groups»)
179 <p> Getting started with the Logical Volume Manager (LVM) requires to
180 get used to a minimal set of vocabulary. This section introduces
181 the words named in the title of the section, and a couple more.
182 The basic concepts of LVM are then described in terms of these words. </p>
184 <div>
185 define(lvm_width», «300»)
186 define(«lvm_height», «183»)
187 define(«lvm_margin», «10»)
188 define(«lvm_extent_size», «10»)
189 define(«lvm_extent», «
190 <rect
191 fill="$1"
192 x="$2"
193 y="$3"
194 width="lvm_extent_size()"
195 height="lvm_extent_size()"
196 stroke="black"
197 stroke-width="1"
198 />
199 »)
200 dnl $1: color, $2: x, $3: y, $4: number of extents
201 define(«lvm_extents», «
202 ifelse(«$4», «0», «», «
203 lvm_extent(«$1», «$2», «$3»)
204 lvm_extents(«$1», eval($2 + lvm_extent_size() + lvm_margin()),
205 «$3», eval($4 - 1))
206 »)
207 »)
208 dnl $1: x, $2: y, $3: number of extents, $4: disk color, $5: extent color
209 define(«lvm_disk», «
210 ifelse(eval(«$3» > 3), «1», «
211 pushdef(«h», «eval(7 * lvm_extent_size())»)
212 pushdef(«w», «eval(($3 + 1) * lvm_extent_size())»)
213 », «
214 pushdef(«h», «eval(3 * lvm_extent_size() + lvm_margin())»)
215 pushdef(«w», «eval($3 * lvm_extent_size() * 2)»)
216 »)
217 svg_disk(«$1», «$2», «w()», «h()», «$4»)
218 ifelse(eval(«$3» > 3), «1», «
219 pushdef(«n1», eval(«$3» / 2))
220 pushdef(«n2», eval(«$3» - n1()))
221 lvm_extents(«$5»,
222 eval(«$1» + (w() - (2 * n1() - 1) * lvm_extent_size()) / 2),
223 eval(«$2» + h() / 2 - lvm_extent_size()), «n1()»)
224 lvm_extents(«$5»,
225 eval(«$1» + (w() - (2 * n2() - 1) * lvm_extent_size()) / 2),
226 eval(«$2» + h() / 2 + 2 * lvm_extent_size()), «n2()»)
227 popdef(«n1»)
228 popdef(«n2»)
229 », «
230 lvm_extents(«$5»,
231 eval(«$1» + (w() - (2 * «$3» - 1) * lvm_extent_size()) / 2),
232 eval(«$2» + h() / 2), «$3»)
233 »)
234 popdef(«w»)
235 popdef(«h»)
236 »)
237 <svg
238 width="lvm_width()" height="lvm_height()"
239 xmlns="http://www.w3.org/2000/svg"
240 xmlns:xlink="http://www.w3.org/1999/xlink"
241 >
242 <rect
243 x=1
244 y=1
245 width="140"
246 height="180"
247 fill="green"
248 rx="10"
249 stroke-width="1"
250 stroke="black"
251 />
252 lvm_disk(«20», «20», «2», «#666», «yellow»)
253 lvm_disk(«10», «90», «4», «#666», «yellow»)
254 lvm_disk(«70», «55», «5», «#666», «yellow»)
255 <path
256 d="
257 M 155 91
258 l 20 0
259 m 0 0
260 l -4 -3
261 l 0 6
262 l 4 -3
263 z
264 "
265 stroke-width="4"
266 stroke="black"
267 fill="black"
268 />
269 lvm_disk(«190», «22», «7», «#66f», «orange»)
270 lvm_disk(«220», «130», «1», «#66f», «orange»)
271 </svg>
272 </div>
274 <p> A <em> Physical Volume</em> (PV, grey) is an arbitrary block device which
275 contains a certain metadata header (also known as <em>superblock</em>)
276 at the start. PVs can be partitions on a local hard disk or a SSD,
277 a soft- or hardware raid, or a loop device. LVM does not care.
278 The storage space on a physical volume is managed in units called <em>
279 Physical Extents </em> (PEs, yellow). The default PE size is 4M. </p>
281 <p> A <em>Volume Group</em> (VG, green) is a non-empty set of PVs with
282 a name and a unique ID assigned to it. A PV can but doesn't need to
283 be assigned to a VG. If it is, the ID of the associated VG is stored
284 in the metadata header of the PV. </p>
286 <p> A <em> Logical Volume</em> (LV, blue) is a named block device which is
287 provided by LVM. LVs are always associated with a VG and are stored
288 on that VG's PVs. Since LVs are normal block devices, file systems
289 of any type can be created on them, they can be used as swap storage,
290 etc. The chunks of a LV are managed as <em>Logical Extents</em> (LEs,
291 orange). Often the LE size equals the PE size. For each LV there is
292 a mapping between the LEs of the LV and the PEs of the underlying
293 PVs. The PEs can spread multiple PVs. </p>
295 <p> VGs can be extended by adding additional PVs to it, or reduced by
296 removing unused devices, i.e., those with no PEs allocated on them. PEs
297 may be moved from one PV to another while the LVs are active. LVs
298 may be grown or shrunk. To grow a LV, there must be enough space
299 left in the VG. Growing a LV does not magically grow the file system
300 stored on it, however. To make use of the additional space, a second,
301 file system specific step is needed to tell the file system that it's
302 underlying block device (the LV) has grown. </p>
304 <p> The exercises of this section illustrate the basic LVM concepts
305 and the essential LVM commands. They ask the reader to create a VG
306 whose PVs are loop devices. This VG is used as a starting point in
307 subsequent chapters. </p>
311 <ul>
313 <li> Create two 5G large loop devices <code>/dev/loop1</code>
314 and <code>/dev/loop2</code>. Make them PVs by running
315 <code>pvcreate</code>. Create a VG <code>tvg</code> (test volume group)
316 from the two loop devices and two 3G large LVs named <code>tlv1</code>
317 and <code>tlv2</code> on it. Run the <code>pvcreate, vgcreate</code>,
318 and <code>lvcreate</code> commands with <code>-v</code> to activate
319 verbose output and try to understand each output line. </li>
321 <li> Run <code>pvs, vgs, lvs, lvdisplay, pvdisplay</code> and examine
322 the output. </li>
324 <li> Run <code>lvdisplay -m</code> to examine the mapping of logical
325 extents to PVs and physical extents. </li>
327 <li> Run <code>pvs --segments -o+lv_name,seg_start_pe,segtype</code>
328 to see the map between physical extents and logical extents. </li>
330 </ul>
334 In the above scenario (two LVs in a VG consisting of two PVs), how
335 can you tell whether both PVs are actually used? Remove the LVs
336 with <code>lvremove</code>. Recreate them, but this time use the
337 <code>--stripes 2</code> option to <code>lvcreate</code>. Explain
338 what this option does and confirm with a suitable command.
340 »)
342 SECTION(«Device Mapper and Device Mapper Targets»)
344 <p> The kernel part of the Logical Volume Manager (LVM) is called
345 <em>device mapper</em> (DM), which is a generic framework to map
346 one block device to another. Applications talk to the Device Mapper
347 via the <em>libdevmapper</em> library, which issues requests
348 to the <code>/dev/mapper/control</code> character device using the
349 <code>ioctl(2)</code> system call. The device mapper is also accessible
350 from scripts via the <code>dmsetup(8)</code> tool. </p>
352 <p> A DM target represents one particular mapping type for ranges
353 of LEs. Several DM targets exist, each of which which creates and
354 maintains block devices with certain characteristics. In this section
355 we take a look at the <code>dmsetup</code> tool and the relatively
356 simple <em>mirror</em> target. Subsequent sections cover other targets
357 in more detail. </p>
361 <ul>
363 <li> Run <code>dmsetup targets</code> to list all targets supported
364 by the currently running kernel. Explain their purpose and typical
365 use cases. </li>
367 <li> Starting with the <code>tvg</code> VG, remove <code>tlv2</code>.
368 Convince yourself by running <code>vgs</code> that <code>tvg</code>
369 is 10G large, with 3G being in use. Run <code>pvmove
370 /dev/loop1</code> to move the used PEs of <code>/dev/loop1</code>
371 to <code>/dev/loop2</code>. After the command completes, run
372 <code>pvs</code> again to see that <code>/dev/loop1</code> has no
373 more PEs in use. </li>
375 <li> Create a third 5G loop device <code>/dev/loop3</code>, make it a
376 PV and extend the VG with <code>vgextend tvg /dev/loop3</code>. Remove
377 <code>tlv1</code>. Now the LEs of <code>tlv2</code> fit on any
378 of the three PVs. Come up with a command which moves them to
379 <code>/dev/loop3</code>. </li>
381 <li> The first two loop devices are both unused. Remove them from
382 the VG with <code>vgreduce -a</code>. Why are they still listed in
383 the <code>pvs</code> output? What can be done about that? </li>
385 </ul>
389 As advertised in the introduction, LVM allows the administrator to
390 replace the underlying storage of a file system online. This is done
391 by running a suitable <code>pvmove(8)</code> command to move all PEs of
392 one PV to different PVs in the same VG.
394 <ul>
396 <li> Explain the mapping type of dm-mirror. </li>
398 <li> The traditional way to mirror the contents of two or more block
399 devices is software raid 1, also known as <em>md raid1</em> ("md"
400 is short for multi-disk). Explain the difference between md raid1,
401 the dm-raid target which supports raid1 and other raid levels, and
402 the dm-mirror target. </li>
404 <li> Guess how <code>pvmove</code> is implemented on top of
405 dm-mirror. Verify your guess by reading the "NOTES" section of the
406 <code>pvmove(8)</code> man page. </li>
408 </ul>
409 »)
411 SECTION(«LVM Snapshots»)
413 <p> LVM snapshots are based on the CoW optimization
414 strategy described earlier in the chapter on <a
415 href="./Unix_Concepts.html#the_virtual_address_space_of_a_unix_process">Unix
416 Concepts</a>. Creating a snapshot means to create a CoW table of
417 the given size. Just before a LE of a snapshotted LV is about to be
418 written to, its contents are copied to a free slot in the CoW
419 table. This preserves an old version of the LV, the snapshot, which
420 can later be reconstructed by overlaying the CoW table atop the LV.
422 <p> Snapshots can be taken from a LV which contains a mounted file system,
423 while applications are actively modifying files. Without coordination
424 between the file system and LVM, the file system most likely has memory
425 buffers scheduled for writeback. These outstanding writes did not make
426 it to the snapshot, so one can not expect the snapshot to contain a
427 consistent file system image. Instead, it is in a similar state as a
428 regular device after an unclean shutdown. This is not a problem for
429 XFS and EXT4, as both are <em>journalling</em> file systems, which
430 were designed with crash recovery in mind. At the next mount after a
431 crash, journalling file systems replay their journal, which results
432 in a consistent state. Note that this implies that even a read-only
433 mount of the snapshot device has to write to the device. </p>
437 <ul>
439 <li> In the test VG, create a 1G large snapshot named
440 <code>snap_tlv1</code> of the <code>tlv1</code> VG by using the
441 <code>-s</code> option to <code>lvcreate(8)</code>. Predict how much
442 free space is left in the VG. Confirm with <code>vgs tvg</code>. </li>
444 <li> Create an EXT4 file system on <code>tlv1</code> by running
445 <code>mkfs.ext4 /dev/tvg/lv1</code>. Guess how much of the snapshot
446 space has been allocated by this operation. Check with <code>lvs
447 tvg1/snap_lv1</code>. </li>
449 <li> Remove the snapshot with <code>lvremove</code> and recreate
450 it. Repeat the previous step, but this time run <code>mkfs.xfs</code>
451 to create an XFS file system. Run <code>lvs tvg/snap_lv1</code>
452 again and compare the used snapshot space to the EXT4 case. Explain
453 the difference. </li>
455 <li> Remove the snapshot and recreate it so that both <code>tlv1</code>
456 and <code>snap_tlv1</code> contain a valid XFS file system. Mount
457 the file systems on <code>/mnt/1</code> and <code>/mnt/2</code>. </li>
459 <li> Run <code>dd if=/dev/zero of=/mnt/1/zero count=$((2 * 100 *
460 1024))</code> to create a 100M large file on <code>tlv1</code>. Check
461 that <code>/mnt/2</code> is still empty. Estimate how much of the
462 snapshot space is used and check again. </li>
464 <li> Repeat the above <code>dd</code> command 5 times and run
465 <code>lvs</code> again. Explain why the used snapshot space did not
466 increase. </li>
468 <li> It is possible to create snapshots of snapshots. This is
469 implemented by chaining together CoW tables. Describe the impact on
470 performance. </li>
472 <li> Suppose a snapshot was created before significant modifications
473 were made to the contents of the LV, for example an upgrade of a large
474 software package. Assume that the user wishes to permanently return to
475 the old version because the upgrade did not work out. In this scenario
476 it is the snapshot which needs to be retained, rather than the original
477 LV. In view of this scenario, guess what happens on the attempt to
478 remove a LV which is being snapshotted. Unmount <code>/mnt/1</code>
479 and confirm by running <code>lvremove tvg/lv1</code>. </li>
481 <li> Come up with a suitable <code>lvconvert</code> command which
482 replaces the role of the LV and its snapshot. Explain why this solves
483 the "bad upgrade" problem outlined above. </li>
485 <li> Explain what happens if the CoW table fills up. Confirm by
486 writing a file larger than the snapshot size. </li>
488 </ul>
490 SECTION(«Thin Provisioning»)
492 <p> The term "thin provisioning" is just a modern buzzword for
493 over-subscription. Both terms mean to give the appearance of having
494 more resources than are actually available. This is achieved by
495 on-demand allocation. The thin provisioning implementation of Linux
496 is implemented as a DM target called <em>dm-thin</em>. This code
497 first made its appearance in 2011 and was declared as stable two
498 years later. These days it should be safe for production use. </p>
500 <p> The general problem with thin provisioning is of course that bad
501 things happen when the resources are exhausted because the demand has
502 increased before new resources were added. For dm-thin this can happen
503 when users write to their allotted space, causing dm-thin to attempt
504 allocating a data block from a volume which is already full. This
505 usually leads to severe data corruption because file systems are
506 not really prepared to handle this error case and treat it as if the
507 underlying block device had failed. dm-thin does nothing to prevent
508 this, but one can configure a <em>low watermark</em>. When the
509 number of free data blocks drops below the watermark, a so-called
510 <em>dm-event</em> will be generated to notice the administrator. </p>
512 <p> One highlight of dm-thin is its efficient support for an arbitrary
513 depth of recursive snapshots, called <em>dm-thin snapshots</em>
514 in this document. With the traditional snapshot implementation,
515 recursive snapshots quickly become a performance issue as the depth
516 increases. With dm-thin one can have an arbitrary subset of all
517 snapshots active at any point in time, and there is no ordering
518 requirement on activating or removing them. </p>
520 <p> The block devices created by dm-thin always belong to a <em>thin
521 pool</em> which ties together two LVs called the <em>metadata LV</em>
522 and the <em>data LV</em>. The combined LV is called the <em>thin pool
523 LV</em>. Setting up a VG for thin provisioning is done in two steps:
524 First the standard LVs for data and the metatdata are created. Second,
525 the two LVs are combined into a thin pool LV. The second step hides
526 the two underlying LVs so that only the combined thin pool LV is
527 visible afterwards. Thin provisioned LVs and dm-thin snapshots can
528 then be created from the thin pool LV with a single command. </p>
530 <p> Another nice feature of dm-thin are <em>external snapshots</em>.
531 An external snapshot is one where the origin for a thinly provisioned
532 device is not a device of the pool. Arbitrary read-only block
533 devices can be turned into writable devices by creating an external
534 snapshot. Reads to an unprovisioned area of the snapshot will be passed
535 through to the origin. Writes trigger the allocation of new blocks as
536 usual with CoW. One use case for this is VM hosts which run their VMs
537 on thinly-provisioned volumes but have the base image on some "master"
538 device which is read-only and can hence be shared between all VMs. </p>
542 <p> Starting with the <code>tvg</code> VG, create and test a thin pool LV
543 by performing the following steps. The "Thin Usage" section of
544 <code>lvmthin(7)</code> will be helpful.
546 <ul>
548 <li> Remove the <code>tlv1</code> and <code>tlv2</code> LVs. </li>
550 <li> Create a 5G data LV named <code>tdlv</code> (thin data LV)
551 and a 500M LV named <code>tmdlv</code> (thin metada LV). </li>
553 <li> Combine the two LVs into a thin pool with
554 <code>lvconvert</code>. Run <code>lvs -a</code> and explain the flags
555 listed below <code>Attr</code>. </li>
557 <li> Create a 10G thin LV named <code>oslv</code> (over-subscribed
558 LV). </li>
560 <li> Create an XFS file system on <code>oslv</code> and mount it on
561 <code>/mnt</code>. </li>
563 <li> Run a loop of the form <code>for ((i = 0; i &lt; 50; i++)): do
564 ... ; done</code> so that each iteration creates a 50M file named
565 <code>file-$i</code> and a snapshot named <code>snap_oslv-$i</code>
566 of <code>oslv</code>. </li>
568 <li> Activate an arbitrary snapshot with <code>lvchange -K</code> and
569 try to mount it. Explain what the error message means. Then read the
570 "XFS on snapshots" section of <code>lvmthin(7)</code>. </li>
572 <li> Check the available space of the data LV with <code>lvs
573 -a</code>. Mount one snapshot (specifying <code>-o nouuid</code>)
574 and run <code>lvs -a</code> again. Why did the free space decrease
575 although no new files were written? </li>
577 <li> Mount four different snapshots and check that they contain the
578 expected files. </li>
580 <li> Remove all snapshots. Guess what <code>lvs -a</code> and <code>dh
581 -h /mnt</code> report. Then run the commands to confirm. Guess
582 what happens if you try to create another 3G file? Confirm
583 your guess, then read the section on "Data space exhaustion" of
584 <code>lvmthin(7)</code>. </li>
586 </ul>
590 When a thin pool provisions a new data block for a thin LV, the new
591 block is first overwritten with zeros by default. Discuss why this
592 is done, its impact on performance and security, and conclude whether
593 or not it is a good idea to turn off the zeroing.
595 »)
597 SECTION(«Bcache, dm-cache and dm-writecache»)
599 <p> All three implementations named in the title of this chapter are <em>
600 Linux block layer caches</em>. They combine two different block
601 devices to form a hybrid block device which dynamically caches
602 and migrates data between the two devices with the aim to improve
603 performance. One device, the <em> backing device</em>, is expected
604 to be large and slow while the other one, the <em>cache device</em>,
605 is expected to be small and fast. </p>
607 <div>
608 define(«bch_width», «300»)
609 define(«bch_height», «130»)
610 define(«bch_margin», «10»)
611 define(«bch_rraid_width», «eval((bch_width() - 4 * bch_margin()) * 4 / 5)»)
612 define(«bch_raidbox_height», «eval(bch_height() - 2 * bch_margin())»)
613 define(«bch_nraid_width», «eval(bch_rraid_width() / 4)»)
614 define(«bch_rdisk_width», «eval((bch_width() - 3 * bch_margin()) * 18 / 100)»)
615 define(«bch_rdisk_height», «eval((bch_height() - 4 * bch_margin()) / 3)»)
616 define(«bch_ndisk_width», «eval(bch_rdisk_width() / 2)»)
617 define(«bch_ndisk_height», «eval(bch_raidbox_height() - 5 * bch_margin())»)
618 define(«bch_rdisk», «svg_disk(«$1», «$2»,
619 «bch_rdisk_width()», «bch_rdisk_height()», «#666»)»)
620 define(«bch_ndisk», «svg_disk(«$1», «$2»,
621 «bch_ndisk_width()», «bch_ndisk_height()», «#66f»)»)
622 define(«bch_5rdisk», «
623 bch_rdisk(«$1», «$2»)
624 bch_rdisk(«eval($1 + bch_margin())»,
625 «eval($2 + bch_margin())»)
626 bch_rdisk(«eval($1 + 2 * bch_margin())»,
627 «eval($2 + 2 * bch_margin())»)
628 bch_rdisk(«eval($1 + 3 * bch_margin())»,
629 «eval($2 + 3 * bch_margin())»)
630 bch_rdisk(«eval($1 + 4 * bch_margin())»,
631 «eval($2 + 4 * bch_margin())»)
633 »)
634 define(«bch_rraid», «
635 <rect
636 fill="#3b3"
637 stroke="black"
638 x="$1"
639 y="$2"
640 width="bch_rraid_width()"
641 height="bch_raidbox_height()"
642 rx=10
643 />
644 bch_5rdisk(«eval($1 + bch_margin())»,
645 «eval($2 + 2 * bch_margin())»)
646 bch_5rdisk(«eval($1 + 2 * bch_rdisk_width() + bch_margin())»,
647 «eval($2 + 2 * bch_margin())»)
648 »)
649 define(«bch_nraid», «
650 <rect
651 fill="orange"
652 stroke="black"
653 x="$1"
654 y="$2"
655 width="bch_nraid_width()"
656 height="bch_raidbox_height()"
657 rx=10
658 />
659 bch_ndisk(eval($1 + bch_margin()),
660 eval($2 + 2 * bch_margin()))
661 bch_ndisk(eval($1 + 2 * bch_margin()),
662 eval($2 + 3 * bch_margin()))
663 »)
665 <svg
666 width="bch_width()" height="bch_height()"
667 xmlns="http://www.w3.org/2000/svg"
668 xmlns:xlink="http://www.w3.org/1999/xlink"
669 >
670 <rect
671 fill="#cc2"
672 stroke="black"
673 stroke-width="1"
674 x="1"
675 y="1"
676 width="eval(bch_rraid_width() + bch_nraid_width()
677 + 3 * bch_margin() - 2)"
678 height="eval(bch_raidbox_height() + 2 * bch_margin() - 2)"
679 rx="10"
680 />
681 bch_nraid(«bch_margin()», «bch_margin()»)
682 bch_rraid(«eval(2 * bch_margin() + bch_nraid_width())», «bch_margin()»)
683 </svg>
684 </div>
686 <p> The most simple setup consists of a single rotating disk and one SSD.
687 The setup shown in the diagram at the left is realistic for a large
688 server with redundant storage. In this setup the hybrid device
689 (yellow) combines a raid6 array (green) consisting of many rotating
690 disks (grey) with a two-disk raid1 array (orange) stored on fast
691 NVMe devices (blue). In the simple setup it is always a win when
692 I/O is performed from/to the SSD instead of the rotating disk. In
693 the server setup, however, it depends on the workload which device
694 is faster. Given enough rotating disks and a streaming I/O workload,
695 the raid6 outperforms the raid1 because all disks can read or write
696 at full speed. </p>
698 <p> Since block layer caches hook into the Linux block API described <a
699 href="«#»the_linux_block_layer">earlier</a>, the hybrid block devices
700 they provide can be used like any other block device. In particular,
701 the hybrid devices are <em> file system agnostic</em>, meaning that
702 any file system can be created on them. In what follows we briefly
703 describe the differences between the three block layer caches and
704 conclude with the pros and cons of each. </p>
706 <p> Bcache is a stand-alone stacking device driver which was
707 included in the Linux kernel in 2013. According to the <a
708 href="https://bcache.evilpiepirate.org/">bcache home page</a>, it
709 is "done and stable". dm-cache and dm-writecache are device mapper
710 targets included in 2013 and 2018, respectively, which are both marked
711 as experimental. In contrast to dm-cache, dm-writecache only caches
712 writes while reads are supposed to be cached in RAM. It has been
713 designed for programs like databases which need low commit latency.
714 Both bcache and dm-cache can operate in writeback or writethrough
715 mode while dm-writecache always operates in writeback mode. </p>
717 <p> The DM-based caches are designed to leave the decision as to what
718 data to migrate (and when) to user space while bcache has this policy
719 built-in. However, at this point only the <em> Stochastic Multiqueue
720 </em> (smq) policy for dm-cache exists, plus a second policy which
721 is only useful for decommissioning the cache device. There are no
722 tunables for dm-cache while all the bells and whistles of bcache can
723 be configured through sysfs files. Another difference is that bcache
724 detects sequential I/O and separates it from random I/O so that large
725 streaming reads and writes bypass the cache and don't push cached
726 randomly accessed data out of the cache. </p>
728 <p> bcache is the clear winner of this comparison because it is stable,
729 configurable and performs better at least on the server setup
730 described above because it separate random and sequential I/O. The
731 only advantage of dm-cache is its flexibility because cache policies
732 can be switched. But even this remains a theoretical advantage as
733 long as only a single policy for dm-cache exists. </p>
737 <ul>
739 <li> Recall the concepts of writeback and writethrough and explain
740 why writeback is faster and writethrough is safer. </li>
742 <li> Explain how the <em>writearound</em> mode of bcache works and
743 when it should be used. </li>
745 <li> Setup a bcache device from two loop devices. </li>
747 <li> Create a file system of a bcache device and mount it. Detach
748 the cache device while the file system is mounted. </li>
750 <li> Setup a dm-cache device from two loop devices. </li>
752 <li> Setup a thin pool where the data LV is a dm-cache device.</li>
754 <li> Explain the point of dm-cache's <em>passthrough</em> mode.</li>
756 </ul>
760 Explain why small writes to a file system which is stored on a
761 parity raid result in read-modify-write (RMW) updates. Explain why
762 RMW updates are particularly expensive and how raid implementations
763 and block layer caches try to avoid them.
765 »)
769 Recall the concepts of writeback and writethrough. Describe what
770 each mode means for a hardware device and for a bcache/dm-cache
771 device. Explain why writeback is faster and writethrough is safer.
773 »)
777 TRIM and UNMAP are special commands in the ATA/SCSI command sets
778 which inform an SSD that certain data blocks are no longer in use,
779 allowing the SSD to re-use these blocks to increase performance and
780 to reduce wear. Subsequent reads from the trimmed data blocks will
781 not return any meaningful data. For example, the <code> mkfs </code>
782 commands sends this command to discard all blocks of the device.
783 Discuss the implications when <code> mkfs. </code> is run on a device
784 provided by bcache or dm-cache.
786 »)
788 SECTION(«The dm-crypt Target»)
790 <p> This device mapper target provides encryption of arbitrary block
791 devices by employing the primitives of the crypto API of the Linux
792 kernel. This API provides a uniform interface to a large number of
793 cipher algorithms which have been implemented with performance and
794 security in mind. </p>
796 <p> The cipher algorithm of choice for the encryption of block devices
797 is the <em> Advanced Encryption Standard </em> (AES), also known
798 as <em> Rijndael</em>, named after the two Belgian cryptographers
799 Rijmen and Daemen who proposed the algorithm in 1999. AES is a <em>
800 symmetric block cipher</em>. That is, a transformation which operates
801 on fixed-length blocks and which is determined by a single key for both
802 encryption and decryption. The underlying algorithm is fairly simple,
803 which makes AES perform well in both hardware and software. Also
804 the key setup time and the memory requirements are excellent. Modern
805 processors of all manufacturers include instructions to perform AES
806 operations in hardware, improving speed and security. </p>
808 <p> According to the Snowden documents, the NSA has been doing research
809 on breaking AES for a long time without being able to come up with
810 a practical attack for 256 bit keys. Successful attacks invariably
811 target the key management software instead, which is often implemented
812 poorly, trading security for user-friendliness, for example by
813 storing passwords weakly encrypted, or by providing a "feature"
814 which can decrypt the device without knowing the password. </p>
816 <p> The exercises of this section ask the reader to encrypt a loop device
817 with AES without relying on any third party key management software </p>.
820 <ul>
821 <li> Discuss the message of this <a
822 href="https://xkcd.com/538/">xkcd</a> comic. </li>
824 <li> How can a hardware implementation of an algorithm like AES
825 improve security? After all, it is the same algorithm that is
826 implemented. </li>
828 <li> What's the point of the <a href="#random_stream">rstream.c</a>
829 program below which writes random data to stdout? Doesn't <code>
830 cat /dev/urandom </code> do the same? </li>
832 <li> Compile and run <a href="#random_stream">rstream.c</a> to create
833 a 10G local file and create the loop device <code> /dev/loop0 </code>
834 from the file. </li>
836 <li> A <em> table </em> for the <code> dmsetup(8) </code> command is
837 a single line of the form <code> start_sector num_sectors target_type
838 target_args</code>. Determine the correct values for the first three
839 arguments to encrypt <code> /dev/loop0</code>. </li>
841 <li> The <code> target_args </code> for the dm-crypt target are
842 of the form <code> cipher key iv_offset device offset</code>. To
843 encrypt <code> /dev/loop0 </code> with AES-256, <code> cipher </code>
844 is <code> aes</code>, device is <code> /dev/loop0 </code> and both
845 offsets are zero. Come up with an idea to create a 256 bit key from
846 a passphrase. </li>
848 <li> The <code> create </code> subcommand of <code> dmsetup(8)
849 </code> creates a device from the given table. Run a command of
850 the form <code> echo "$table" | dmsetup create cryptdev </code>
851 to create the encrypted device <code> /dev/mapper/cryptdev </code>
852 from the loop device. </li>
854 <li> Create a file system on <code> /dev/mapper/cryptdev</code>,
855 mount it and create the file <code> passphrase </code> containing
856 the string "super-secret" on this file system. </li>
858 <li> Unmount the <code> cryptdev </code> device and run <code> dmsetup
859 remove cryptdev</code>. Run <code> strings </code> on the loop device
860 and on the underlying file to see if it contains the string <code>
861 super-secret" </code> or <code> passphrase</code>. </li>
863 <li> Re-create the <code> cryptdev </code> device, but this time use
864 a different (hence invalid) key. Guess what happens and confirm. </li>
866 <li> Write a script which disables echoing (<code>stty -echo</code>),
867 reads a passphrase from stdin and combines the above steps to create
868 and mount an encrypted device. </li>
870 </ul>
874 Why is it a good idea to overwrite a block device with random data
875 before it is encrypted?
877 »)
881 The dm-crypt target encrypts whole block devices. An alternative is
882 to encrypt on the file system level. That is, each file is encrypted
883 separately. Discuss the pros and cons of both approaches.
885 »)
889 SUBSECTION(«Random stream»)
891 <pre>
892 <code>
893 /* Link with -lcrypto */
894 #include &lt;openssl/rand.h&gt;
895 #include &lt;stdio.h&gt;
896 #include &lt;unistd.h&gt;
897 #include &lt;stdio.h&gt;
899 int main(int argc, char **argv)
900 {
901 unsigned char buf[1024 * 1024];
903 for (;;) {
904 int ret = RAND_bytes(buf, sizeof(buf));
906 if (ret &lt;= 0) {
907 fprintf(stderr, "RAND_bytes() error\n");
908 exit(EXIT_FAILURE);
909 }
910 ret = write(STDOUT_FILENO, buf, sizeof(buf));
911 if (ret &lt; 0) {
912 perror("write");
913 exit(EXIT_FAILURE);
914 }
915 }
916 return 0;
917 }
918 </code>
919 </pre>