fs: Add two LWN links to NFS articles by Neil Brown.
[aple.git] / LVM.m4
3 Who the heck is General Failure, and why is he reading my disk? -- Unknown
5 », __file__)
9 The idea of Logical Volume Management is to decouple data and
10 storage. This offers great flexibility in managing storage and reduces
11 server downtimes because the storage may be replaced while file
12 systems are mounted read-write and applications are actively using
13 them. This chapter provides an introduction to the Linux block layer
14 and LVM. Subsequent sections cover selected device mapper targets.
16 »)
18 SECTION(«The Linux Block Layer»)
20 <p> The main task of LVM is the management of block devices, so it is
21 natural to start an introduction to LVM with a section on the Linux
22 block layer, which is the central component in the Linux kernel
23 for the handling of persistent storage devices. The mission of the
24 block layer is to provide a uniform interface to different types
25 of storage devices. The obvious in-kernel users of this interface
26 are the file systems and the swap subsystem. But also <em> stacking
27 device drivers </em> like LVM, Bcache and MD access block devices
28 through this interface to create virtual block devices from other block
29 devices. Some user space programs (<code>fdisk, dd, mkfs, ...</code>)
30 also need to access block devices. The block layer allows them to
31 perform their task in a well-defined and uniform manner through
32 block-special device files. </p>
34 <p> The userspace programs and the in-kernel users interact with the block
35 layer by sending read or write requests. A <em>bio</em> is the central
36 data structure that carries such requests within the kernel. Bios
37 may contain an arbitrary amount of data. They are given to the block
38 layer to be queued for subsequent handling. Often a bio has to travel
39 through a stack of block device drivers where each driver modifies
40 the bio and sends it on to the next driver. Typically, only the last
41 driver in the stack corresponds to a hardware device. </p>
43 <p> Besides requests to read or write data blocks, there are various other
44 bio requests that carry SCSI commands like FLUSH, FUA (Force Unit
45 Access), TRIM and UNMAP. FLUSH and FUA ensure that certain data hits
46 stable storage. FLUSH asks the the device to write out the contents of
47 its volatile write cache while a FUA request carries data that should
48 be written directly to the device, bypassing all caches. UNMAP/TRIM is
49 a SCSI/ATA command which is only relevant to SSDs. It is a promise of
50 the OS to not read the given range of blocks any more, so the device
51 is free to discard the contents and return arbitrary data on the
52 next read. This helps the device to level out the number of times
53 the flash storage cells are overwritten (<em>wear-leveling</em>),
54 which improves the durability of the device. </p>
56 <p> The first task of the block layer is to split incoming bios if
57 necessary to make them conform to the size limit or the alignment
58 requirements of the target device, and to batch and merge bios so that
59 they can be submitted as a unit for performance reasons. The thusly
60 processed bios then form an I/O request which is handed to an <em>
61 I/O scheduler </em> (also known as <em> elevator</em>). </p>
63 <p> At this time of writing (2018-11) there exist two different sets
64 of schedulers: the traditional single-queue schedulers and the
65 modern multi-queue schedulers, which are expected to replace the
66 single-queue schedulers soon. The three single-queue schedulers,
67 noop, deadline and cfq (complete fair queueing), were designed for
68 rotating disks. They reorder requests with the aim to minimize seek
69 time. The newer multi-queue schedulers, mq-deadline, kyber, and bfq
70 (budget fair queueing), aim to max out even the fastest devices. As
71 implied by the name "multi-queue", they implement several request
72 queues, the number of which depends on the hardware in use. This
73 has become necessary because modern storage hardware allows multiple
74 requests to be submitted in parallel from different CPUs. Moreover,
75 with many CPUs the locking overhead required to put a request into
76 a queue increases. Per-CPU queues allow for per-CPU locks, which
77 decreases queue lock contention. </p>
79 <p> We will take a look at some aspects of the Linux block layer and on
80 the various I/O schedulers. An exercise on loop devices enables the
81 reader to create block devices for testing. This will be handy in
82 the subsequent sections on LVM specific topics. </p>
86 <ul>
88 <li> Run <code>find /dev -type b</code> to get the list of all block
89 devices on your system. Explain which is which. </li>
91 <li> Examine the files in <code>/sys/block/sda</code>, in
92 particular <code>/sys/block/sda/stat</code>. Search the web for
93 <code>Documentation/block/stat.txt</code> for the meaning of the
94 numbers shown. Then run <code>iostat -xdh sda 1</code>. </li>
96 <li> Examine the files in <code>/sys/block/sda/queue</code>. </li>
98 <li> Find out how to determine the size of a block device. </li>
100 <li> Figure out a way to identify the name of all block devices which
101 correspond to SSDs (i.e., excluding any rotating disks). </li>
103 <li> Run <code>lsblk</code> and discuss
104 the output. Too easy? Run <code>lsblk -o
106 </li>
108 <li> What's the difference between a task scheduler and an I/O
109 scheduler? </li>
111 <li> Why are I/O schedulers also called elevators? </li>
113 <li> How can one find out which I/O schedulers are supported on a
114 system and which scheduler is active for a given block device? </li>
116 <li> Is it possible (and safe) to change the I/O scheduler for a
117 block device while it is in use? If so, how can this be done? </li>
119 <li> The loop device driver of the Linux kernel allows privileged
120 users to create a block device from a regular file stored on a file
121 system. The resulting block device is called a <em>loop</em> device.
122 Create a 1G large temporary file containing only zeroes. Run a suitable
123 <code>losetup(8)</code> command to create a loop device from the
124 file. Create an XFS file system on the loop device and mount it. </li>
126 </ul>
130 <ul>
131 <li> Come up with three different use cases for loop devices. </li>
133 <li> Given a block device node in <code> /dev</code>, how can one
134 tell that it is a loop device? </li>
136 <li> Describe the connection between loop devices created by
137 <code>losetup(8)</code> and the loopback device used for network
138 connections from the machine to itself. </li>
140 </ul>
141 »)
143 define(«svg_disk», «
144 <g
145 fill="$5"
146 stroke="black"
147 stroke-width="1"
148 >
149 <ellipse
150 cx="eval($1 + $3 / 2)"
151 cy="eval($2 + $4)"
152 rx="eval($3 / 2)"
153 ry="eval($3 / 4)"
154 />
155 <rect
156 x="$1"
157 y="$2"
158 width="$3"
159 height="$4"
160 />
161 <rect
162 x="eval($1 + 1)"
163 y="eval($2 + $4 - 1)"
164 width="eval($3 - 2)"
165 height="2"
166 stroke="$5"
167 />
168 <ellipse
169 cx="eval($1 + $3 / 2)"
170 cy="$2"
171 rx="eval($3 / 2)"
172 ry="eval($3 / 4)"
173 />
174 </g>
175 »)
177 SECTION(«Physical and Logical Volumes, Volume Groups»)
179 <p> Getting started with the Logical Volume Manager (LVM) requires to
180 get used to a minimal set of vocabulary. This section introduces
181 the words named in the title of the section, and a couple more.
182 The basic concepts of LVM are then described in terms of these words. </p>
184 <div>
185 define(lvm_width», «300»)
186 define(«lvm_height», «183»)
187 define(«lvm_margin», «10»)
188 define(«lvm_extent_size», «10»)
189 define(«lvm_extent», «
190 <rect
191 fill="$1"
192 x="$2"
193 y="$3"
194 width="lvm_extent_size()"
195 height="lvm_extent_size()"
196 stroke="black"
197 stroke-width="1"
198 />
199 »)
200 dnl $1: color, $2: x, $3: y, $4: number of extents
201 define(«lvm_extents», «
202 ifelse(«$4», «0», «», «
203 lvm_extent(«$1», «$2», «$3»)
204 lvm_extents(«$1», eval($2 + lvm_extent_size() + lvm_margin()),
205 «$3», eval($4 - 1))
206 »)
207 »)
208 dnl $1: x, $2: y, $3: number of extents, $4: disk color, $5: extent color
209 define(«lvm_disk», «
210 ifelse(eval(«$3» > 3), «1», «
211 pushdef(«h», «eval(7 * lvm_extent_size())»)
212 pushdef(«w», «eval(($3 + 1) * lvm_extent_size())»)
213 », «
214 pushdef(«h», «eval(3 * lvm_extent_size() + lvm_margin())»)
215 pushdef(«w», «eval($3 * lvm_extent_size() * 2)»)
216 »)
217 svg_disk(«$1», «$2», «w()», «h()», «$4»)
218 ifelse(eval(«$3» > 3), «1», «
219 pushdef(«n1», eval(«$3» / 2))
220 pushdef(«n2», eval(«$3» - n1()))
221 lvm_extents(«$5»,
222 eval(«$1» + (w() - (2 * n1() - 1) * lvm_extent_size()) / 2),
223 eval(«$2» + h() / 2 - lvm_extent_size()), «n1()»)
224 lvm_extents(«$5»,
225 eval(«$1» + (w() - (2 * n2() - 1) * lvm_extent_size()) / 2),
226 eval(«$2» + h() / 2 + 2 * lvm_extent_size()), «n2()»)
227 popdef(«n1»)
228 popdef(«n2»)
229 », «
230 lvm_extents(«$5»,
231 eval(«$1» + (w() - (2 * «$3» - 1) * lvm_extent_size()) / 2),
232 eval(«$2» + h() / 2), «$3»)
233 »)
234 popdef(«w»)
235 popdef(«h»)
236 »)
237 <svg
238 width="lvm_width()" height="lvm_height()"
239 xmlns="http://www.w3.org/2000/svg"
240 xmlns:xlink="http://www.w3.org/1999/xlink"
241 >
242 <rect
243 x=1
244 y=1
245 width="140"
246 height="180"
247 fill="green"
248 rx="10"
249 stroke-width="1"
250 stroke="black"
251 />
252 lvm_disk(«20», «20», «2», «#666», «yellow»)
253 lvm_disk(«10», «90», «4», «#666», «yellow»)
254 lvm_disk(«70», «55», «5», «#666», «yellow»)
255 <path
256 d="
257 M 155 91
258 l 20 0
259 m 0 0
260 l -4 -3
261 l 0 6
262 l 4 -3
263 z
264 "
265 stroke-width="4"
266 stroke="black"
267 fill="black"
268 />
269 lvm_disk(«190», «22», «7», «#66f», «orange»)
270 lvm_disk(«220», «130», «1», «#66f», «orange»)
271 </svg>
272 </div>
274 <p> A <em> Physical Volume</em> (PV, grey) is an arbitrary block device which
275 contains a certain metadata header (also known as <em>superblock</em>)
276 at the start. PVs can be partitions on a local hard disk or a SSD,
277 a soft- or hardware raid, or a loop device. LVM does not care.
278 The storage space on a physical volume is managed in units called <em>
279 Physical Extents </em> (PEs, yellow). The default PE size is 4M. </p>
281 <p> A <em>Volume Group</em> (VG, green) is a non-empty set of PVs with
282 a name and a unique ID assigned to it. A PV can but doesn't need to
283 be assigned to a VG. If it is, the ID of the associated VG is stored
284 in the metadata header of the PV. </p>
286 <p> A <em> Logical Volume</em> (LV, blue) is a named block device which is
287 provided by LVM. LVs are always associated with a VG and are stored
288 on that VG's PVs. Since LVs are normal block devices, file systems
289 of any type can be created on them, they can be used as swap storage,
290 etc. The chunks of a LV are managed as <em>Logical Extents</em> (LEs,
291 orange). Often the LE size equals the PE size. For each LV there is
292 a mapping between the LEs of the LV and the PEs of the underlying
293 PVs. The PEs can spread multiple PVs. </p>
295 <p> VGs can be extended by adding additional PVs to it, or reduced by
296 removing unused devices, i.e., those with no PEs allocated on them. PEs
297 may be moved from one PV to another while the LVs are active. LVs
298 may be grown or shrunk. To grow a LV, there must be enough space
299 left in the VG. Growing a LV does not magically grow the file system
300 stored on it, however. To make use of the additional space, a second,
301 file system specific step is needed to tell the file system that it's
302 underlying block device (the LV) has grown. </p>
304 <p> The exercises of this section illustrate the basic LVM concepts
305 and the essential LVM commands. They ask the reader to create a VG
306 whose PVs are loop devices. This VG is used as a starting point in
307 subsequent chapters. </p>
311 <ul>
313 <li> Create two 5G large loop devices <code>/dev/loop1</code>
314 and <code>/dev/loop2</code>. Make them PVs by running
315 <code>pvcreate</code>. Create a VG <code>tvg</code> (test volume group)
316 from the two loop devices and two 3G large LVs named <code>tlv1</code>
317 and <code>tlv2</code> on it. Run the <code>pvcreate, vgcreate</code>,
318 and <code>lvcreate</code> commands with <code>-v</code> to activate
319 verbose output and try to understand each output line. </li>
321 <li> Run <code>pvs, vgs, lvs, lvdisplay, pvdisplay</code> and examine
322 the output. </li>
324 <li> Run <code>lvdisplay -m</code> to examine the mapping of logical
325 extents to PVs and physical extents. </li>
327 <li> Run <code>pvs --segments -o+lv_name,seg_start_pe,segtype</code>
328 to see the map between physical extents and logical extents. </li>
330 </ul>
334 In the above scenario (two LVs in a VG consisting of two PVs), how
335 can you tell whether both PVs are actually used? Remove the LVs
336 with <code>lvremove</code>. Recreate them, but this time use the
337 <code>--stripes 2</code> option to <code>lvcreate</code>. Explain
338 what this option does and confirm with a suitable command.
340 »)
342 SECTION(«Device Mapper and Device Mapper Targets»)
344 <p> The kernel part of the Logical Volume Manager (LVM) is called
345 <em>device mapper</em> (DM), which is a generic framework to map
346 one block device to another. Applications talk to the Device Mapper
347 via the <em>libdevmapper</em> library, which issues requests
348 to the <code>/dev/mapper/control</code> character device using the
349 <code>ioctl(2)</code> system call. The device mapper is also accessible
350 from scripts via the <code>dmsetup(8)</code> tool. </p>
352 <p> A DM target represents one particular mapping type for ranges
353 of LEs. Several DM targets exist, each of which which creates and
354 maintains block devices with certain characteristics. In this section
355 we take a look at the <code>dmsetup</code> tool and the relatively
356 simple <em>mirror</em> target. Subsequent sections cover other targets
357 in more detail. </p>
361 <ul>
363 <li> Run <code>dmsetup targets</code> to list all targets supported
364 by the currently running kernel. Explain their purpose and typical
365 use cases. </li>
367 <li> Starting with the <code>tvg</code> VG, remove <code>tlv2</code>.
368 Convince yourself by running <code>vgs</code> that <code>tvg</code>
369 is 10G large, with 3G being in use. Run <code>pvmove
370 /dev/loop1</code> to move the used PEs of <code>/dev/loop1</code>
371 to <code>/dev/loop2</code>. After the command completes, run
372 <code>pvs</code> again to see that <code>/dev/loop1</code> has no
373 more PEs in use. </li>
375 <li> Create a third 5G loop device <code>/dev/loop3</code>, make it a
376 PV and extend the VG with <code>vgextend tvg /dev/loop3</code>. Remove
377 <code>tlv1</code>. Now the LEs of <code>tlv2</code> fit on any
378 of the three PVs. Come up with a command which moves them to
379 <code>/dev/loop3</code>. </li>
381 <li> The first two loop devices are both unused. Remove them from
382 the VG with <code>vgreduce -a</code>. Why are they still listed in
383 the <code>pvs</code> output? What can be done about that? </li>
385 </ul>
389 As advertised in the introduction, LVM allows the administrator to
390 replace the underlying storage of a file system online. This is done
391 by running a suitable <code>pvmove(8)</code> command to move all PEs of
392 one PV to different PVs in the same VG.
394 <ul>
396 <li> Explain the mapping type of dm-mirror. </li>
398 <li> The traditional way to mirror the contents of two or more block
399 devices is software raid 1, also known as <em>md raid1</em> ("md"
400 is short for multi-disk). Explain the difference between md raid1,
401 the dm-raid target which supports raid1 and other raid levels, and
402 the dm-mirror target. </li>
404 <li> Guess how <code>pvmove</code> is implemented on top of
405 dm-mirror. Verify your guess by reading the "NOTES" section of the
406 <code>pvmove(8)</code> man page. </li>
408 </ul>
409 »)
411 SECTION(«LVM Snapshots»)
413 <p> LVM snapshots are based on the CoW optimization strategy described
414 earlier in the chapter on <a href="./Unix_Concepts.html#processes">Unix
415 Concepts</a>. Creating a snapshot means to create a CoW table of the
416 given size. Just before a LE of a snapshotted LV is about to be written
417 to, its contents are copied to a free slot in the CoW table. This
418 preserves an old version of the LV, the snapshot, which can later be
419 reconstructed by overlaying the CoW table atop the LV. </p>
421 <p> Snapshots can be taken from a LV which contains a mounted file system,
422 while applications are actively modifying files. Without coordination
423 between the file system and LVM, the file system most likely has memory
424 buffers scheduled for writeback. These outstanding writes did not make
425 it to the snapshot, so one can not expect the snapshot to contain a
426 consistent file system image. Instead, it is in a similar state as a
427 regular device after an unclean shutdown. This is not a problem for
428 XFS and EXT4, as both are <em>journalling</em> file systems, which
429 were designed with crash recovery in mind. At the next mount after a
430 crash, journalling file systems replay their journal, which results
431 in a consistent state. Note that this implies that even a read-only
432 mount of the snapshot device has to write to the device. </p>
436 <ul>
438 <li> In the test VG, create a 1G large snapshot named
439 <code>snap_tlv1</code> of the <code>tlv1</code> VG by using the
440 <code>-s</code> option to <code>lvcreate(8)</code>. Predict how much
441 free space is left in the VG. Confirm with <code>vgs tvg</code>. </li>
443 <li> Create an EXT4 file system on <code>tlv1</code> by running
444 <code>mkfs.ext4 /dev/tvg/lv1</code>. Guess how much of the snapshot
445 space has been allocated by this operation. Check with <code>lvs
446 tvg1/snap_lv1</code>. </li>
448 <li> Remove the snapshot with <code>lvremove</code> and recreate
449 it. Repeat the previous step, but this time run <code>mkfs.xfs</code>
450 to create an XFS file system. Run <code>lvs tvg/snap_lv1</code>
451 again and compare the used snapshot space to the EXT4 case. Explain
452 the difference. </li>
454 <li> Remove the snapshot and recreate it so that both <code>tlv1</code>
455 and <code>snap_tlv1</code> contain a valid XFS file system. Mount
456 the file systems on <code>/mnt/1</code> and <code>/mnt/2</code>. </li>
458 <li> Run <code>dd if=/dev/zero of=/mnt/1/zero count=$((2 * 100 *
459 1024))</code> to create a 100M large file on <code>tlv1</code>. Check
460 that <code>/mnt/2</code> is still empty. Estimate how much of the
461 snapshot space is used and check again. </li>
463 <li> Repeat the above <code>dd</code> command 5 times and run
464 <code>lvs</code> again. Explain why the used snapshot space did not
465 increase. </li>
467 <li> It is possible to create snapshots of snapshots. This is
468 implemented by chaining together CoW tables. Describe the impact on
469 performance. </li>
471 <li> Suppose a snapshot was created before significant modifications
472 were made to the contents of the LV, for example an upgrade of a large
473 software package. Assume that the user wishes to permanently return to
474 the old version because the upgrade did not work out. In this scenario
475 it is the snapshot which needs to be retained, rather than the original
476 LV. In view of this scenario, guess what happens on the attempt to
477 remove a LV which is being snapshotted. Unmount <code>/mnt/1</code>
478 and confirm by running <code>lvremove tvg/lv1</code>. </li>
480 <li> Come up with a suitable <code>lvconvert</code> command which
481 replaces the role of the LV and its snapshot. Explain why this solves
482 the "bad upgrade" problem outlined above. </li>
484 <li> Explain what happens if the CoW table fills up. Confirm by
485 writing a file larger than the snapshot size. </li>
487 </ul>
489 SECTION(«Thin Provisioning»)
491 <p> The term "thin provisioning" is just a modern buzzword for
492 over-subscription. Both terms mean to give the appearance of having
493 more resources than are actually available. This is achieved by
494 on-demand allocation. The thin provisioning implementation of Linux
495 is implemented as a DM target called <em>dm-thin</em>. This code
496 first made its appearance in 2011 and was declared as stable two
497 years later. These days it should be safe for production use. </p>
499 <p> The general problem with thin provisioning is of course that bad
500 things happen when the resources are exhausted because the demand has
501 increased before new resources were added. For dm-thin this can happen
502 when users write to their allotted space, causing dm-thin to attempt
503 allocating a data block from a volume which is already full. This
504 usually leads to severe data corruption because file systems are
505 not really prepared to handle this error case and treat it as if the
506 underlying block device had failed. dm-thin does nothing to prevent
507 this, but one can configure a <em>low watermark</em>. When the
508 number of free data blocks drops below the watermark, a so-called
509 <em>dm-event</em> will be generated to notice the administrator. </p>
511 <p> One highlight of dm-thin is its efficient support for an arbitrary
512 depth of recursive snapshots, called <em>dm-thin snapshots</em>
513 in this document. With the traditional snapshot implementation,
514 recursive snapshots quickly become a performance issue as the depth
515 increases. With dm-thin one can have an arbitrary subset of all
516 snapshots active at any point in time, and there is no ordering
517 requirement on activating or removing them. </p>
519 <p> The block devices created by dm-thin always belong to a <em>thin
520 pool</em> which ties together two LVs called the <em>metadata LV</em>
521 and the <em>data LV</em>. The combined LV is called the <em>thin pool
522 LV</em>. Setting up a VG for thin provisioning is done in two steps:
523 First the standard LVs for data and the metatdata are created. Second,
524 the two LVs are combined into a thin pool LV. The second step hides
525 the two underlying LVs so that only the combined thin pool LV is
526 visible afterwards. Thin provisioned LVs and dm-thin snapshots can
527 then be created from the thin pool LV with a single command. </p>
529 <p> Another nice feature of dm-thin are <em>external snapshots</em>.
530 An external snapshot is one where the origin for a thinly provisioned
531 device is not a device of the pool. Arbitrary read-only block
532 devices can be turned into writable devices by creating an external
533 snapshot. Reads to an unprovisioned area of the snapshot will be passed
534 through to the origin. Writes trigger the allocation of new blocks as
535 usual with CoW. One use case for this is VM hosts which run their VMs
536 on thinly-provisioned volumes but have the base image on some "master"
537 device which is read-only and can hence be shared between all VMs. </p>
541 <p> Starting with the <code>tvg</code> VG, create and test a thin pool LV
542 by performing the following steps. The "Thin Usage" section of
543 <code>lvmthin(7)</code> will be helpful.
545 <ul>
547 <li> Remove the <code>tlv1</code> and <code>tlv2</code> LVs. </li>
549 <li> Create a 5G data LV named <code>tdlv</code> (thin data LV)
550 and a 500M LV named <code>tmdlv</code> (thin metada LV). </li>
552 <li> Combine the two LVs into a thin pool with
553 <code>lvconvert</code>. Run <code>lvs -a</code> and explain the flags
554 listed below <code>Attr</code>. </li>
556 <li> Create a 10G thin LV named <code>oslv</code> (over-subscribed
557 LV). </li>
559 <li> Create an XFS file system on <code>oslv</code> and mount it on
560 <code>/mnt</code>. </li>
562 <li> Run a loop of the form <code>for ((i = 0; i &lt; 50; i++)): do
563 ... ; done</code> so that each iteration creates a 50M file named
564 <code>file-$i</code> and a snapshot named <code>snap_oslv-$i</code>
565 of <code>oslv</code>. </li>
567 <li> Activate an arbitrary snapshot with <code>lvchange -K</code> and
568 try to mount it. Explain what the error message means. Then read the
569 "XFS on snapshots" section of <code>lvmthin(7)</code>. </li>
571 <li> Check the available space of the data LV with <code>lvs
572 -a</code>. Mount one snapshot (specifying <code>-o nouuid</code>)
573 and run <code>lvs -a</code> again. Why did the free space decrease
574 although no new files were written? </li>
576 <li> Mount four different snapshots and check that they contain the
577 expected files. </li>
579 <li> Remove all snapshots. Guess what <code>lvs -a</code> and <code>dh
580 -h /mnt</code> report. Then run the commands to confirm. Guess
581 what happens if you try to create another 3G file? Confirm
582 your guess, then read the section on "Data space exhaustion" of
583 <code>lvmthin(7)</code>. </li>
585 </ul>
589 When a thin pool provisions a new data block for a thin LV, the new
590 block is first overwritten with zeros by default. Discuss why this
591 is done, its impact on performance and security, and conclude whether
592 or not it is a good idea to turn off the zeroing.
594 »)
596 SECTION(«Bcache, dm-cache and dm-writecache»)
598 <p> All three implementations named in the title of this chapter are <em>
599 Linux block layer caches</em>. They combine two different block
600 devices to form a hybrid block device which dynamically caches
601 and migrates data between the two devices with the aim to improve
602 performance. One device, the <em> backing device</em>, is expected
603 to be large and slow while the other one, the <em>cache device</em>,
604 is expected to be small and fast. </p>
606 <div>
607 define(«bch_width», «300»)
608 define(«bch_height», «130»)
609 define(«bch_margin», «10»)
610 define(«bch_rraid_width», «eval((bch_width() - 4 * bch_margin()) * 4 / 5)»)
611 define(«bch_raidbox_height», «eval(bch_height() - 2 * bch_margin())»)
612 define(«bch_nraid_width», «eval(bch_rraid_width() / 4)»)
613 define(«bch_rdisk_width», «eval((bch_width() - 3 * bch_margin()) * 18 / 100)»)
614 define(«bch_rdisk_height», «eval((bch_height() - 4 * bch_margin()) / 3)»)
615 define(«bch_ndisk_width», «eval(bch_rdisk_width() / 2)»)
616 define(«bch_ndisk_height», «eval(bch_raidbox_height() - 5 * bch_margin())»)
617 define(«bch_rdisk», «svg_disk(«$1», «$2»,
618 «bch_rdisk_width()», «bch_rdisk_height()», «#666»)»)
619 define(«bch_ndisk», «svg_disk(«$1», «$2»,
620 «bch_ndisk_width()», «bch_ndisk_height()», «#66f»)»)
621 define(«bch_5rdisk», «
622 bch_rdisk(«$1», «$2»)
623 bch_rdisk(«eval($1 + bch_margin())»,
624 «eval($2 + bch_margin())»)
625 bch_rdisk(«eval($1 + 2 * bch_margin())»,
626 «eval($2 + 2 * bch_margin())»)
627 bch_rdisk(«eval($1 + 3 * bch_margin())»,
628 «eval($2 + 3 * bch_margin())»)
629 bch_rdisk(«eval($1 + 4 * bch_margin())»,
630 «eval($2 + 4 * bch_margin())»)
632 »)
633 define(«bch_rraid», «
634 <rect
635 fill="#3b3"
636 stroke="black"
637 x="$1"
638 y="$2"
639 width="bch_rraid_width()"
640 height="bch_raidbox_height()"
641 rx=10
642 />
643 bch_5rdisk(«eval($1 + bch_margin())»,
644 «eval($2 + 2 * bch_margin())»)
645 bch_5rdisk(«eval($1 + 2 * bch_rdisk_width() + bch_margin())»,
646 «eval($2 + 2 * bch_margin())»)
647 »)
648 define(«bch_nraid», «
649 <rect
650 fill="orange"
651 stroke="black"
652 x="$1"
653 y="$2"
654 width="bch_nraid_width()"
655 height="bch_raidbox_height()"
656 rx=10
657 />
658 bch_ndisk(eval($1 + bch_margin()),
659 eval($2 + 2 * bch_margin()))
660 bch_ndisk(eval($1 + 2 * bch_margin()),
661 eval($2 + 3 * bch_margin()))
662 »)
664 <svg
665 width="bch_width()" height="bch_height()"
666 xmlns="http://www.w3.org/2000/svg"
667 xmlns:xlink="http://www.w3.org/1999/xlink"
668 >
669 <rect
670 fill="#cc2"
671 stroke="black"
672 stroke-width="1"
673 x="1"
674 y="1"
675 width="eval(bch_rraid_width() + bch_nraid_width()
676 + 3 * bch_margin() - 2)"
677 height="eval(bch_raidbox_height() + 2 * bch_margin() - 2)"
678 rx="10"
679 />
680 bch_nraid(«bch_margin()», «bch_margin()»)
681 bch_rraid(«eval(2 * bch_margin() + bch_nraid_width())», «bch_margin()»)
682 </svg>
683 </div>
685 <p> The most simple setup consists of a single rotating disk and one SSD.
686 The setup shown in the diagram at the left is realistic for a large
687 server with redundant storage. In this setup the hybrid device
688 (yellow) combines a raid6 array (green) consisting of many rotating
689 disks (grey) with a two-disk raid1 array (orange) stored on fast
690 NVMe devices (blue). In the simple setup it is always a win when
691 I/O is performed from/to the SSD instead of the rotating disk. In
692 the server setup, however, it depends on the workload which device
693 is faster. Given enough rotating disks and a streaming I/O workload,
694 the raid6 outperforms the raid1 because all disks can read or write
695 at full speed. </p>
697 <p> Since block layer caches hook into the Linux block API described <a
698 href="«#»the_linux_block_layer">earlier</a>, the hybrid block devices
699 they provide can be used like any other block device. In particular,
700 the hybrid devices are <em> file system agnostic</em>, meaning that
701 any file system can be created on them. In what follows we briefly
702 describe the differences between the three block layer caches and
703 conclude with the pros and cons of each. </p>
705 <p> Bcache is a stand-alone stacking device driver which was
706 included in the Linux kernel in 2013. According to the <a
707 href="https://bcache.evilpiepirate.org/">bcache home page</a>, it
708 is "done and stable". dm-cache and dm-writecache are device mapper
709 targets included in 2013 and 2018, respectively, which are both marked
710 as experimental. In contrast to dm-cache, dm-writecache only caches
711 writes while reads are supposed to be cached in RAM. It has been
712 designed for programs like databases which need low commit latency.
713 Both bcache and dm-cache can operate in writeback or writethrough
714 mode while dm-writecache always operates in writeback mode. </p>
716 <p> The DM-based caches are designed to leave the decision as to what
717 data to migrate (and when) to user space while bcache has this policy
718 built-in. However, at this point only the <em> Stochastic Multiqueue
719 </em> (smq) policy for dm-cache exists, plus a second policy which
720 is only useful for decommissioning the cache device. There are no
721 tunables for dm-cache while all the bells and whistles of bcache can
722 be configured through sysfs files. Another difference is that bcache
723 detects sequential I/O and separates it from random I/O so that large
724 streaming reads and writes bypass the cache and don't push cached
725 randomly accessed data out of the cache. </p>
727 <p> bcache is the clear winner of this comparison because it is stable,
728 configurable and performs better at least on the server setup
729 described above because it separate random and sequential I/O. The
730 only advantage of dm-cache is its flexibility because cache policies
731 can be switched. But even this remains a theoretical advantage as
732 long as only a single policy for dm-cache exists. </p>
736 <ul>
738 <li> Recall the concepts of writeback and writethrough and explain
739 why writeback is faster and writethrough is safer. </li>
741 <li> Explain how the <em>writearound</em> mode of bcache works and
742 when it should be used. </li>
744 <li> Setup a bcache device from two loop devices. </li>
746 <li> Create a file system of a bcache device and mount it. Detach
747 the cache device while the file system is mounted. </li>
749 <li> Setup a dm-cache device from two loop devices. </li>
751 <li> Setup a thin pool where the data LV is a dm-cache device.</li>
753 <li> Explain the point of dm-cache's <em>passthrough</em> mode.</li>
755 </ul>
759 Explain why small writes to a file system which is stored on a
760 parity raid result in read-modify-write (RMW) updates. Explain why
761 RMW updates are particularly expensive and how raid implementations
762 and block layer caches try to avoid them.
764 »)
768 Recall the concepts of writeback and writethrough. Describe what
769 each mode means for a hardware device and for a bcache/dm-cache
770 device. Explain why writeback is faster and writethrough is safer.
772 »)
776 TRIM and UNMAP are special commands in the ATA/SCSI command sets
777 which inform an SSD that certain data blocks are no longer in use,
778 allowing the SSD to re-use these blocks to increase performance and
779 to reduce wear. Subsequent reads from the trimmed data blocks will
780 not return any meaningful data. For example, the <code> mkfs </code>
781 commands sends this command to discard all blocks of the device.
782 Discuss the implications when <code> mkfs. </code> is run on a device
783 provided by bcache or dm-cache.
785 »)
787 SECTION(«The dm-crypt Target»)
789 <p> This device mapper target provides encryption of arbitrary block
790 devices by employing the primitives of the crypto API of the Linux
791 kernel. This API provides a uniform interface to a large number of
792 cipher algorithms which have been implemented with performance and
793 security in mind. </p>
795 <p> The cipher algorithm of choice for the encryption of block devices
796 is the <em> Advanced Encryption Standard </em> (AES), also known
797 as <em> Rijndael</em>, named after the two Belgian cryptographers
798 Rijmen and Daemen who proposed the algorithm in 1999. AES is a <em>
799 symmetric block cipher</em>. That is, a transformation which operates
800 on fixed-length blocks and which is determined by a single key for both
801 encryption and decryption. The underlying algorithm is fairly simple,
802 which makes AES perform well in both hardware and software. Also
803 the key setup time and the memory requirements are excellent. Modern
804 processors of all manufacturers include instructions to perform AES
805 operations in hardware, improving speed and security. </p>
807 <p> According to the Snowden documents, the NSA has been doing research
808 on breaking AES for a long time without being able to come up with
809 a practical attack for 256 bit keys. Successful attacks invariably
810 target the key management software instead, which is often implemented
811 poorly, trading security for user-friendliness, for example by
812 storing passwords weakly encrypted, or by providing a "feature"
813 which can decrypt the device without knowing the password. </p>
815 <p> The exercises of this section ask the reader to encrypt a loop device
816 with AES without relying on any third party key management software </p>.
819 <ul>
820 <li> Discuss the message of this <a
821 href="https://xkcd.com/538/">xkcd</a> comic. </li>
823 <li> How can a hardware implementation of an algorithm like AES
824 improve security? After all, it is the same algorithm that is
825 implemented. </li>
827 <li> What's the point of the <a href="#random_stream">rstream.c</a>
828 program below which writes random data to stdout? Doesn't <code>
829 cat /dev/urandom </code> do the same? </li>
831 <li> Compile and run <a href="#random_stream">rstream.c</a> to create
832 a 10G local file and create the loop device <code> /dev/loop0 </code>
833 from the file. </li>
835 <li> A <em> table </em> for the <code> dmsetup(8) </code> command is
836 a single line of the form <code> start_sector num_sectors target_type
837 target_args</code>. Determine the correct values for the first three
838 arguments to encrypt <code> /dev/loop0</code>. </li>
840 <li> The <code> target_args </code> for the dm-crypt target are
841 of the form <code> cipher key iv_offset device offset</code>. To
842 encrypt <code> /dev/loop0 </code> with AES-256, <code> cipher </code>
843 is <code> aes</code>, device is <code> /dev/loop0 </code> and both
844 offsets are zero. Come up with an idea to create a 256 bit key from
845 a passphrase. </li>
847 <li> The <code> create </code> subcommand of <code> dmsetup(8)
848 </code> creates a device from the given table. Run a command of
849 the form <code> echo "$table" | dmsetup create cryptdev </code>
850 to create the encrypted device <code> /dev/mapper/cryptdev </code>
851 from the loop device. </li>
853 <li> Create a file system on <code> /dev/mapper/cryptdev</code>,
854 mount it and create the file <code> passphrase </code> containing
855 the string "super-secret" on this file system. </li>
857 <li> Unmount the <code> cryptdev </code> device and run <code> dmsetup
858 remove cryptdev</code>. Run <code> strings </code> on the loop device
859 and on the underlying file to see if it contains the string <code>
860 super-secret" </code> or <code> passphrase</code>. </li>
862 <li> Re-create the <code> cryptdev </code> device, but this time use
863 a different (hence invalid) key. Guess what happens and confirm. </li>
865 <li> Write a script which disables echoing (<code>stty -echo</code>),
866 reads a passphrase from stdin and combines the above steps to create
867 and mount an encrypted device. </li>
869 </ul>
873 Why is it a good idea to overwrite a block device with random data
874 before it is encrypted?
876 »)
880 The dm-crypt target encrypts whole block devices. An alternative is
881 to encrypt on the file system level. That is, each file is encrypted
882 separately. Discuss the pros and cons of both approaches.
884 »)
888 SUBSECTION(«Random stream»)
890 <pre>
891 <code>
892 /* Link with -lcrypto */
893 #include &lt;openssl/rand.h&gt;
894 #include &lt;stdio.h&gt;
895 #include &lt;unistd.h&gt;
896 #include &lt;stdio.h&gt;
898 int main(int argc, char **argv)
899 {
900 unsigned char buf[1024 * 1024];
902 for (;;) {
903 int ret = RAND_bytes(buf, sizeof(buf));
905 if (ret &lt;= 0) {
906 fprintf(stderr, "RAND_bytes() error\n");
907 exit(EXIT_FAILURE);
908 }
909 ret = write(STDOUT_FILENO, buf, sizeof(buf));
910 if (ret &lt; 0) {
911 perror("write");
912 exit(EXIT_FAILURE);
913 }
914 }
915 return 0;
916 }
917 </code>
918 </pre>