Initial commit.
[aple.git] / Filesystems.m4
1 TITLE(«
2
3 Happy filesystems are all alike, but every corrupted filesystem is
4 unhappy in its own way. -- Jonathan Corbet (2011)
5
6 », __file__)
7
8 OVERVIEW(«
9
10 The first part of this chapter covers general concepts related to
11 filesystems. This part is largely independent of the underlying
12 operating system. The later sections look at some aspects of two
13 popular local local filesystems for Linux: ext4 and xfs. The last
14 section contains selected topics related to the network filesystem,
15 nfs.
16 »)
17
18 SECTION(«Introduction»)
19
20 <p> Every Unix system has at least one filesystem which contains
21 the root of the tree of files. This filesystem, the <em>root file
22 system</em>, is normally stored on a local block device, for example on
23 an SSD. On top of that, several additional filesystems of different
24 types are mounted usually. For example, most Linux distributions
25 mount the proc and sys pseudo filesystems and several instances of
26 tmpfs. Other filesystems may be mounted as needed. </p>
27
28 SUBSECTION(«Classification»)
29
30 <p> The Linux kernel supports several dozens of different filesystems
31 and new ones are added frequently. Some filesystems are only employed
32 for special purpose computers while others are in use on almost all
33 systems. The <code>/proc/filesystems</code> pseudo file contains the
34 list of supported filesystems. We don't aim to provide a full listing
35 but classify filesystems as belonging to exactly one of the following
36 categories. </p>
37
38 <dl>
39 <dt> local </dt>
40
41 <dd> The filesystem is stored on a local block device, and only the
42 local computer can access it. Examples: ext4, xfs, fat. </dd>
43
44 <dt> pseudo </dt>
45
46 <dd> These filesystems are characterized by the absence of
47 backing storage. That is, there is no block device which stores the
48 contents. Instead, contents exist only in memory, or are provided on
49 demand (i.e., when the files are accessed). When the filesystem is
50 unmounted, all data is lost. Examples: tmpfs, sysfs, proc. </dd>
51
52 <dt> network </dt>
53
54 <dd> This type of filesystem makes files which are physically stored
55 on a different computer visible on the local computer. Examples: nfs,
56 cifs. </dd>
57
58 <dt> fuse (filesystem in user space) </dt>
59
60 <dd> Contents are provided by a user space application. Examples:
61 sshfs. </dd>
62
63 <dt> distributed </dt>
64
65 <dd> The contents are stored on more than one computer. Examples:
66 glusterfs, lustre, nfs-4.1. </dd>
67
68 </dl>
69
70 SUBSECTION(«POSIX Filesystems»)
71
72 <p> Regardless of the category, most native Unix filesystems support
73 the semantics prescribed by the POSIX.1-2008 standard. These include
74 <code>open(2), read(2), write(2), stat(2)</code> system calls and
75 many more. In particular, a POSIX filesystem must store in each
76 file's metadata the user and group ID of the owner and the usual
77 three timestamps. For compatibility reasons this is not possible for
78 "foreign" filesystems like Microsoft's FAT which was designed for
79 the single-user DOS operating system and thus has no concept of file
80 ownership. </p>
81
82 SUBSECTION(«User, Group, Directory and Project Quotas»)
83
84 <p> Early Unix systems already supported user and group quotas while
85 directory and project quotas are comparatively new concepts. User and
86 group quotas impose an upper bound on the files owned by a specific
87 user or Unix group, respectively. Directory quotas restrict the size
88 of an accounted directory. Project quotas are similar to directory
89 quotas but are not restricted to single directories. They are realized
90 as an aggregation of unrelated inodes with a specific identifier,
91 the <em>project ID</em>, which is stored in each accounted inode.
92 It is possible for arbitrary files to have the same project ID and
93 hence be accounted to the same project. The project ID is independent
94 from the UID and the GID, hence project accounting is independent of
95 user and group accounting. </p>
96
97 <p> For each quota type there are two configurable limits: the
98 <em>inode limit</em> which imposes a bound on the number of files
99 and directories owned by a specific user/group/project ID, and the
100 <em>block limit</em> which bounds the space that the same set of files
101 is permitted to occupy. Each limit is in fact a pair of limits: the
102 <em>soft limit</em> and the <em>hard limit</em>. When utilization
103 reaches the soft limit, a message is sent but no further action is
104 taken. Only if the hard limit is reached, subsequent attempts to
105 request more resources fail with the <code>EDQUOT</code> (<code>Disk
106 quota exceeded</code>) error. </p>
107
108 SECTION(«Filesystem Design»)
109
110 In this section we take a closer look at the data structures of local
111 filesystems, their on-disk layout, and some of the techniques that
112 make filesystems fast and robust.
113
114 SUBSECTION(«Superblocks»)
115
116 <p> The <em>superblock</em> describes the <em>geometry</em> of the
117 file system. Only one superblock is needed for normal operation but
118 most local filesystems store backup copies of the superblock. These
119 extra superblocks are only accessed during recovery from filesystem
120 corruption, for example if the main superblock was overwritten by
121 accident. Although the details vary between filesystem types, the
122 following information is typically stored in the superblock: </p>
123
124 <ul>
125 <li> The <em>magic number</em>, or <em>signature</em> which identifies
126 the block device as containing a filesystem of a certain type. For
127 example, the magic number for ext2 is 0xef534. Newer filesystems use
128 larger signatures. </li>
129
130 <li> The size of the filesystem. </li>
131
132 <li> The last mount time.
133
134 <li> A universally unique identifier (UUID) which is created randomly
135 at filesystem creation time. </li>
136
137 <li> Utilization counts like the number of free blocks. </li>
138 </ul>
139
140 <p> The blkid library, libblkid, contains a database of filesystems and
141 of other software which stores its metadata in a specific superblock
142 of a block device. This includes filesystems, swap devices, physical
143 volumes for software raid or the logical volume manager. The library
144 enables applications to identify the contents of on a block device
145 and to extract additional information like the UUID. There are a
146 number of tools which are linked against this library to examine
147 or modify the superblocks: <code>lsblk(8)</code> to list block
148 devices, <code>blkid(8)</code> to print block device properties, and
149 <code>wipefs(8)</code> to overwrite (or merely print) all superblocks
150 of a given block device. Also the <code>mount(8)</code> executable is
151 usually linked against libblkid to support mounting by UUID instead
152 of device name because device names like <code>/dev/sda</code> might
153 change across reboots. </p>
154
155 SUBSECTION(«B-trees and Variants»)
156
157 <p> Most filesystems, including the ext4 and xfs filesystems described
158 in dedicated sections, employ some B-tree variant to manage their
159 data blocks. This is reason enough to take a closer look at the key
160 features of this ubiquitous data structure. We won't go into detail,
161 though. </p>
162
163 <p> B-trees were invented in the 1970s by Rudolf Bayer and Ed McCreight
164 as a data structure for an algorithm that can quickly access a random
165 block in a particular file stored on on a rotating disk. The parameters
166 can be tuned to minimize the number of disk operations performed,
167 i.e., the height of the tree. </p>
168
169 <p> It is unclear what the "B" in "B-tree" actually means. It certainly
170 does not mean "binary", because B-trees typically have a high fanout,
171 so nodes have way more than two children (which, by definition,
172 is the maximal number of child nodes for a binary tree). The typical
173 fanout values for filesystems range from 100 to 1000. </p>
174
175 <p> B-trees impose a fixed lower bound on the number of child nodes, and
176 an upper bound that is twice as large as the lower bound. The upper
177 bound is called the <em>order</em> of the tree. These bounds imply
178 an upper bound for the maximal height of the tree, given the number
179 of leaf nodes. Hence a B-tree is always well balanced, lookup times
180 are always optimal (i.e., logarithmic), and storage utilization is
181 at least 50%. Unlike a hash table approach there is no decrease in
182 performance when utilization approaches 100%. </p>
183
184 <p> Addition and removal of nodes is performed in a way that keeps
185 the tree balanced. For example, if a node is being removed and this
186 results in a violation of the lower bound on the number of child nodes
187 of the parent node, nodes are moved between siblings, or siblings
188 are merged. </p>
189
190 <p> A node with <em>k</em> children always has <em>k - 1</em> keys.
191 For filesystems, the keys can be block numbers, hashes, directory
192 entries, or the size of block ranges. In any case, the keys act as
193 separators which divide the child nodes. For example, if the node
194 contains 3 child nodes and the two keys 23 and 42, then the left child
195 node contains keys less than 23, the middle child node contains keys
196 between 23 and 42, and the right child node contains keys greater
197 than 42. </p>
198
199 <p> Many filesystems, including xfs, do not use the classic B-tree
200 structure outlined above but its variant called <em>B+tree</em>. The
201 main differences between a B-tree and a B+tree are (a) data records
202 are stored only in leaf nodes, and (b) leaves are linked together so
203 that all data records can be traversed in-order by following sibling
204 pointers. This little detail has far-reaching consequences. Roughly
205 speaking, the sibling pointers prohibit copy on write for metadata
206 blocks. </p>
207
208 SUBSECTION(«Journaling»)
209
210 <p> Single filesystem operations often need to update multiple
211 blocks. For example, adding a new file to a directory requires three
212 block updates: the data block which contains the new file's contents,
213 the metadata block which contains the directory entries, and the
214 on-disk structures that manage the free blocks. Regardless of the
215 order in which these updates are performed, an unclean shutdown due
216 to a power outage or a system crash leads to a corrupt filesystem if
217 the shutdown happens after the first but before the third update was
218 performed. Journaling is a capability which avoids this situation by
219 making multiple block updates atomic, thereby ensuring consistency
220 after an unclean shutdown. </p>
221
222 <p> One of the first journaling filesystems was jfs, introduced 1990 by
223 IBM for the AIX operating system. The first journaling filesystem for
224 Linux was reiserfs version 3, which was included in the Linux kernel
225 in 2001. In the same year Linux gained support for three additional
226 journaling filesystems: jfs and xfs were ported to Linux from AIX
227 and IRIX, respectively, and the first stable version of ext3 was
228 released. </p>
229
230 <p> Journaling filesystems retain consistency after an unclean shutdown by
231 keeping a <em>journal</em> (also known as <em>log</em>) of operations
232 being performed. The journal is either stored on an separate device or
233 in a reserved area within the filesystem. The entries of the journal,
234 the <em>log records</em>, describe <em>transactions</em>, which are
235 filesystem operations that must be performed atomically. At the next
236 mount after an unclean shutdown, the journal is <em>replayed</em>,
237 that is, the recorded transactions are reapplied. The time required
238 to replay the journal depends only on the size of the journal and
239 the number of log records, but not on the size of the filesystem
240 or the number of files. It usually takes only a couple of seconds,
241 which is considerably faster than a filesystem check/repair run, which
242 can take hours. </p>
243
244 <p> Although a journaling filesystem writes metadata twice, this can
245 actually <em>increase</em> performance because metadata writes to the
246 journal are sequential, and when the log entries are committed, writes
247 can be combined and reordered, which is a win not only for rotating
248 disks. Since data integrity is usually less important than filesystem
249 integrity, only metadata (inodes, directory contents) is journaled
250 by default while data blocks (file contents) are written directly. </p>
251
252 SUBSECTION(«Delayed Allocation»)
253
254 <p> This term refers to the technique of deferring the decision of
255 which blocks to allocate until the last possible moment, when blocks
256 have to be written out due to memory pressure or an explicit sync
257 request from user space. This technique is employed by xfs and ext4
258 but not by earlier versions of the ext* family. </p>
259
260 <p> Delayed allocation improves the performance of the filesystem because
261 the allocator has better knowledge of the eventual file size, so
262 files are more likely to be laid out in an optimal way: data blocks
263 sequentially, and close to the metadata blocks that describe the
264 file. Moreover, small temporary files can be fully buffered in memory
265 and don't cause any block allocation at all if the in-memory data
266 structures have already been removed when writeout triggers. Delayed
267 allocation is more effective on large memory systems because with
268 plenty of memory available, allocations can be deferred for longer. </p>
269
270 EXERCISES()
271
272 <ul>
273 <li> On a Linux system, run <code>cat /proc/filesystems</code> to
274 see all file systems supported on this system. </li>
275
276 <li> Discuss to which extent snapshot-capable filesystems can replace
277 off-site backups. </li>
278
279 <li> Run <code>mount | awk '{print $5}' | sort | uniq</code> to
280 examine the different filesystem types which are mounted. Explain
281 the purpose of each and determine to which of the above classes the
282 filesystem belongs. </li>
283
284 <li> Understand the impact on performance of the atime mount
285 option. Check which file systems are mounted with atime enabled. </li>
286
287 <li> Discuss the contents of <code>/etc/mtab</code>, and
288 <code>/proc/filesystems</code>. Check whether <code>/etc/mtab</code>
289 is a symbolic link. Discuss whether it should be one. </li>
290
291 <li> Does a read-only mount of a journaling filesystem modify the
292 contents of the underlying block device? </li>
293
294 </ul>
295
296 HOMEWORK(«
297
298 Describe the various file locking types mandated by POSIX-2008.1.
299
300 »)
301
302 HOMEWORK(«
303
304 <ul>
305 <li> Summarize what POSIX <code>sync(2), fsync(2)</code> and
306 <code>fdatasync(2)</code> guarantee. </li>
307
308 <li> Consider the following commands: <code>touch foo; echo bar >
309 foo; sync; mv foo bar; echo baz > foo; fsync foo</code>. Assume the
310 system crashed at some point in between. Describe the possible states
311 of the two files after the journal has been replayed. </li>
312 </ul>
313
314 »)
315
316 HOMEWORK(«
317
318 <em>Delayed logging</em> is a feature of ext3 which was later ported
319 to xfs. The term refers to the concept of accumulating changes in
320 memory before writing them to the journal. Discuss the pros and cons
321 of delayed logging.
322
323 »)
324
325 HOMEWORK(«
326
327 Explain what <em>reflink</em> copies are and discuss when and how to
328 use them.
329
330 »)
331
332 SECTION(«Alternatives to Journaling»)
333
334 <p> Although our focus lies in topics related to journaling
335 filesystems, we shall have a brief look at two different (but
336 related) approaches which also guarantee consistency after a crash.
337 Both approaches depart from the traditional way of updating data
338 and metadata structures. To illustrate the difference, consider
339 the situation where a line is appended to an existing text file
340 on a traditional filesystem. The filesystem first updates the data
341 blocks which contains the file contents. Since the file size and the
342 modification time have changed, the inode of the file needs to be
343 updated as well. Finally, the on-disk data structures which keep track
344 of the used and unused blocks might also need to be updated if a new
345 block had to be allocated for the additional line. All three updates
346 are usually performed in-place by overwriting the existing blocks.
347 The filesystems described in this section are different in that they
348 avoid such in-place changes. </p>
349
350 SUBSECTION(«Log-structured Filesystems»)
351
352 <p> A log-structured filesystem only writes sequential log entries which
353 describe the changes that have been made, essentially treating the
354 entire space as the journal. The first log-structured filesystem was
355 developed in the 1990s. Popular log-structured filesystems in use
356 today are logfs, ubifs (the <em>unsorted block image filesystem</em>),
357 and f2fs (the <em>flash-friendly filesystem</em>). </p>
358
359 <p> The layout of the data and metadata blocks of a log-structured
360 filesystem bears advantages and disadvantages. One advantage is that
361 writes are always sequential, which is particularly good for rotating
362 disks. However, reading a large file sequentially becomes slower
363 because of data fragmentation. The purely sequential writes also reduce
364 the decay of the storage media, particularly flash memory, because
365 without in-place writes, all blocks are written about the same number
366 of times. Other advantages of log-structured filesystem are that crash
367 recovery is conceptionally easy and that snapshots are natural. </p>
368
369 <p> The main disadvantage is the overhead incurred by <em>garbage
370 collection</em>: getting rid of old log entries that have been
371 superseded by later ones. This overhead is particularly large if free
372 space is short. Another disadvantage is related to inode management:
373 since inodes are scattered throughout the disk, and the location of an
374 inode changes whenever the file is updated, some kind of <em>inode
375 map</em> is required to locate inodes. Updates to the inode map
376 can be written to the log sequentially, but the current location
377 of the inode map must be stored in a special fixed location called
378 the <em>checkpoint region</em> so that the filesystem can recover
379 from a crash. Since every write to the checkpoint region causes
380 a seek, log-structured filesystems update the inode map only once
381 in a while, for example once per minute. This process is known as
382 <em>checkpointing</em>. </p>
383
384 SUBSECTION(«Copy on Write Filesystems»)
385
386 <p> In the context of filesystems, the term <em>copy on write</em>
387 (CoW) means to not overwrite existing data or metadata blocks as
388 files are modified. Instead, whenever the filesystem needs to modify
389 a data or metadata block, it writes the modified block to different,
390 currently unused location, leaving the contents of the original
391 block intact. Next, the metadata blocks that need to be altered
392 are also written to a free location without overwriting existing
393 metadata blocks. For example, in the scenario outlined above where the
394 <code>write(2)</code> system call extends an existing file, the three
395 updates for data, inode and free space information are performed as
396 writes to unused locations. Next, the filesystem switches to the new
397 metadata to commit the change. CoW filesystems are always consistent
398 if this last step can be performed atomically. </p>
399
400 <p> Two well-known open source CoW filesystem are zfs and btrfs
401 (the <em>B-tree filesystem</em>). The first stable release of zfs for
402 the Solaris operating system appeared in 2005, after four years of
403 development by Sun Microsystems. Since then zfs has been ported to
404 several other operating systems including FreeBSD and Linux. However,
405 the zfs code is licensed under the CDDL license, which is regarded
406 as incompatible with the GPL. For this reason, zfs is not included
407 in the official Linux operating system kernel but is available as
408 a third-party kernel module. Btrfs is another CoW filesystem which
409 was designed for Linux from the start. It was merged into the Linux
410 kernel in 2009. The feature sets of zfs and btrfs are similar, but
411 the implementation details vary. </p>
412
413 <p> CoW filesystems also have disadvantages. On a system where multiple
414 processes append data to different files simultaneously, the data
415 blocks of each file will be fragmented, which is bad for performance.
416 For the same reason, metadata blocks are fragmented as well, so
417 performance suffers if the filesystem contains many files. Another
418 disadvantage is related to the fact that it is difficult to tell
419 how much space is going to be needed for a given CoW operation,
420 which has caused an endless list of bugs that occur when disk space
421 gets tight. This is why CoW filesystems should never use more than a
422 certain ratio of the available space. For zfs the recommended limit
423 is as low as 70%. </p>
424
425 EXERCISES()
426
427 <ul>
428 <li> zfs and btrfs are not only filesystems: they include many other
429 features which are traditionally performed by the volume management
430 subsystem or by the raid driver. This includes device group handling,
431 snapshotting, mirroring to get redundancy, striping to increase
432 performance, and more.
433
434 <p> Search the web for "blatant layering violation" and glance over
435 the (sometimes heated) discussions on whether or not it is a good
436 idea to combine volume management, raid and filesystems. Then form
437 your own opinion about the topic. </p> </li>
438 </ul>
439
440 SECTION(«Encryption»)
441
442 <p> The dm-crypt device mapper target, which was covered in the <a
443 href="LVM.html">chapter on LVM</a>, operates at the block level.
444 It encrypts and decrypts one block at a time, regardless of whether
445 the block is in use. This is in contrast to <em>filesystem-level
446 encryption</em>, where encryption is performed by the filesystem
447 on a per inode basis so that, for example, different files can
448 be encrypted with different keys. </p>
449
450 <p> In this section we look at two filesystem-level encryption
451 primitives for Linux: ecryptfs (the <em>enterprise cryptographic
452 filesystem</em>) and fscrypt (<em>filesystem encryption</em>). The
453 former was included into Linux in 2006 while the latter is much
454 newer. It was originally part of the <em>flash-friendly filesystem</em>
455 (f2fs) but has been made generic in 2015. Besides f2fs, also ext4
456 and ubifs rely on fscrypt for encryption. </p>
457
458 <p> By definition, filesystem-level encryption means to encrypt the
459 contents of regular files. In addition to that, both ecryptfs and
460 fscrypt also encrypt file names. However, information stored in the
461 inode is left unencrypted. Therefore, without the encryption key it is
462 still possible to list directories as usual, but the list will contain
463 only encrypted filenames. Moreover, file size and timestamps can be
464 read as for unencrypted files with the standard <code>stat(2)</code>
465 system call. </p>
466
467 SUBSECTION(«ecryptfs»)
468
469 <p> ecryptfs is a so-called <em>stacked</em> filesystem. That is,
470 it relies on an (arbitrary) mounted filesystem as backend storage. An
471 ecryptfs mount is similar to a bind mount in that it makes the files
472 stored at source location (the mountpoint of the backend storage)
473 visible at the target location. However, the source usually contains
474 only encrypted files while files appear unencrypted at the target
475 location. Each encrypted file is self-contained in the sense that
476 it starts with a header which, together with the encryption key, is
477 sufficient to decrypt the file (and the file name). Hence encrypted
478 files can be copied between hosts, and encrypted files can be backed
479 up without telling the backup software the encryption key. </p>
480
481 SUBSECTION(«fscrypt»)
482
483 <p> fscrypt takes a different approach. It provides encryption
484 through a general library that can, in principle, be used by any
485 filesystem. With fscrypt it is possible to store both encrypted and
486 unencrypted files on the same filesystem. fscrypt has a lower memory
487 footprint than ecryptfs since it avoids caching filesystem contents
488 twice. Also, only half as many directory entries and inodes are
489 needed. Another advantage of fscrypt is that the fscrypt API can be
490 used by unprivileged users, with no need to mount a second filesystem.
491 The major drawback of fscrypt is that <code>open(2)</code> system call
492 fails without the key. Since backup software has to open regular files,
493 it is not possible to backup encrypted files without the encryption
494 key. </p>
495
496 EXERCISES()
497 <ul>
498 <li> Explain why the root directory of the filesystem cannot be
499 encrypted with fscrypt. </li>
500 </ul>
501
502 HOMEWORK(«
503
504 Discuss the pros and cons of filesystem level encryption vs.
505 block level encryption.
506
507 »)
508
509 SECTION(«The Virtual Filesystem Switch (vfs)»)
510
511 <p> The main task of the vfs is to provide an abstraction for
512 applications to access files in a uniform way, even if the files
513 are stored on different filesystems. The vfs is responsible for
514 parsing path names received from user space via system calls, and
515 to forward the requests it can not handle itself to the specific
516 filesystem implementation, thereby associating paths with instances
517 of mounted filesystems. This encourages a modular filesystem design
518 where filesystems are opaque entities which provide a certain set of
519 methods called <em>filesystem operations</em>, for example mounting
520 the filesystem or opening a file. The modular design helps to avoid
521 code duplication, which is important for operating systems like Linux
522 which support many different filesystems. </p>
523
524 <p> The first vfs implementation was probably shipped in the Solaris
525 Unix System in 1985. Linux got its first vfs implementation together
526 with the <em>extended</em> filesystem, the predecessor of ext2,
527 ext3 and ext4. All modern operating systems have some sort of vfs,
528 although implementation details differ. In what follows, we shall
529 only discuss the Linux vfs. </p>
530
531 <p> Filesystems register themselves with the vfs at boot time or
532 when the kernel module for the filesystem is loaded. The vfs keeps
533 track of the available filesystem types and all mounts. To perform
534 efficiently, the vfs maintains several data structures which describe
535 the characteristics of the tree of files. We look at the most important
536 data structures below but leave out the rather complicated details
537 about how the various locking primitives (spinlocks, refcounts, RCU)
538 are employed to deal with concurrency. </p>
539
540 SUBSECTION(«The Dentry Cache»)
541
542 <p> A <em>dentry</em> (short for "directory entry") is a data structure
543 which represents a file or a directory. Dentries contain pointers to
544 the corresponding inode and to the parent dentry. The vfs maintains
545 the <em>dentry cache</em>, which is independent of the normal page
546 cache that keeps copies of file contents in memory. Dentries are kept
547 in hashed lists to make directory lookups fast. </p>
548
549 <p> On a busy system the dentry cache changes frequently. For example,
550 file creation, removal and rename all trigger an update of the dentry
551 cache. Moreover, memory pressure can cause dentries to be evicted from
552 the cache at any time. Clearly, some sort of coordination is needed to
553 keep the dentry cache consistent in view of concurrent changes, like
554 a file being deleted on one CPU and looked up on another. A global
555 lock would scale very poorly, so a more sophisticated method called
556 <em>RCU-walk</em> is employed. With RCU, lookups can be performed
557 without taking locks, and read operations can proceed in parallel
558 with concurrent writers. </p>
559
560 <p> The dentry cache also contains <em>negative</em> entries
561 which represent nonexistent paths which were recently looked up
562 unsuccessfully. When a user space program tries to access such a path
563 again, the <code>ENOENT</code> error can be returned without involving
564 the filesystem. Since lookups of nonexistent files happen frequently,
565 failing such lookups quickly enhances performance. Naturally, negative
566 dentries do not point to any inode. </p>
567
568 <p> Dentries are reference-counted. As long as there is a reference
569 on a dentry, it can not be pruned from the dentry cache. </p>
570
571 SUBSECTION(«File and Inode Objects»)
572
573 <p> Positive entries in the dentry cache point to <em>inode
574 objects</em>, which are in-memory copies of the on-disk inode
575 structures maintained by the filesystem. Different dentry cache entries
576 can map to the same inode object if the underlying filesystem supports
577 hard links, but entries which refer to directories are unique. The
578 <code>stat(2)</code> system call can be served without calling into
579 the filesystem if the path argument of the system call corresponds
580 to an entry of the dentry cache. </p>
581
582 <p> When a file is opened, the vfs allocates a <em>file object</em>
583 (also known as <em>file description</em> and <em>struct file</em>),
584 and adds a reference to the file object to the calling process' table
585 of open files. The index to this table is returned as the <em>file
586 descriptor</em> from the <code>open(2)</code> system call. The file
587 object contains a reference to the dentry and to the filesystem
588 specific methods for the usual operations like <code>read(2)</code>
589 and <code>write(2)</code>. Also the file offset and the file status
590 flags (<code>O_NONBLOCK, O_SYNC, etc.</code>) are recorded in the
591 file object. </p>
592
593 <p> Like dentries, file objects are reference-counted. Once the
594 counter hits zero, the file object is freed. System calls like
595 <code>dup(2)</code> and <code>fork(2)</code> increase the reference
596 count while <code>close(2)</code> and <code>exit(2)</code> decrease
597 it. However, not all file object references correspond to file
598 descriptors, since also the kernel itself can hold references to a
599 file object. For example, the loop device driver and the ecryptfs
600 stacked filesystem increase the reference counter of the file objects
601 they work with. Passing a file descriptor from one process to another
602 via local sockets is another situation where the reference counters
603 of the affected file object need to be adjusted. </p>
604
605 <p> When an application calls <code>dup(2)</code> to duplicate a
606 file descriptor, the two copies refer to the same file object and
607 therefore share the file offset and the status flags of the file
608 object. An <code>open(2)</code> call, however, creates a new file
609 object even if the file is already open. </p>
610
611 SUBSECTION(«vfs Mounts and vfs Superblocks»)
612
613 <p> Another job of the vfs is to keep track of the tree of all mounts
614 and their properties. Do do so, the vfs maintains a tree of <em>mount
615 structures</em>. Whenever a filesystem is mounted, one such structure
616 is allocated and linked into the tree. Among other information,
617 the mount structure contains various pointers, including pointers to </p>
618
619 <ul>
620 <li> the mount structure of the parent mount, </li>
621
622 <li> the dentry that corresponds to the mountpoint (the root of the
623 mounted filesystem), </li>
624
625 <li> the superblock of the mounted filesystem, </li>
626
627 <li> the containing mount namespace. </li>
628 </ul>
629
630 Other information stored in the mount structure:
631
632 <ul>
633 <li> the list of child mounts, </li>
634
635 <li> the mount propagation rules, </li>
636
637 <li> the mount flags (<code>MS_RDONLY, MS_NOEXEC, MS_NOATIME</code>,
638 etc.). </li>
639 </ul>
640
641 <p> Since a filesystem can be bind-mounted, there can be several mount
642 structures whose superblock pointers point to the same superblock.
643 The superblock structure contains the UUID, quota information,
644 granularity of timestamps, and much more. </p>
645
646 EXERCISES()
647
648 <ul>
649 <li> Run a command like <code>strace -e%file ls</code> to see the
650 (negative) dentry cache in action. </li>
651
652 <li> Guess which system calls the <code>df(1)</code> command
653 needs to perform to do its work. Confirm by running <code>strace
654 df</code>. </li>
655
656 <li> The <code>mountpoint(1)</code> command determines if a given
657 directory is the mountpoint of a filesystem. Explain how the command
658 is implemented. </li>
659
660 <li> Assume a process opens a file by calling <code>open(2)</code>,
661 then forks so that the child process inherits the file descriptor.
662 Assume further that the child calls <code>lseek(2)</code> to reposition
663 the file offset. Will the parent process see the modified offset? </li>
664
665 <li> Is it safe to run <code>fsck(8)</code> on a filesystem which has
666 been lazily unmounted (by executing <code>mount -l</code>)? </li>
667
668 <li> What is the purpose of the UUID of a filesystem? Why is it
669 generally a good idea to put the UUID instead of the device name as
670 the first field in <code>/etc/fstab</code>? Figure out how to print
671 the UUID of an ext4 and an xfs filesystem. </li>
672
673 <li> When a block device containing a filesystem is cloned with
674 <code>dd(8)</code>, the two UUIDs match. The same is true if the block
675 device was snapshotted with LVM (regardless of whether thin or regular
676 snapshots were used). Discuss the consequences of this fact. </li>
677
678 <li> For a network filesystem like nfs, a dentry can become invalid if
679 the corresponding file was modified on the server or by another
680 client. Discuss the implications of this fact with respect to the
681 dentry cache. </li>
682
683 <li> The <code>umount -a</code> command unmounts all filesystems which
684 are not busy. Describe an algorithm which walks the mount tree to
685 implement the command, taking into account that a mount is busy if
686 it has at least one child mount. </li>
687 </ul>
688
689 HOMEWORK(«
690
691 Describe the concept of a file change notification system and discuss
692 possible use cases. The Linux kernel contains three different file
693 change notification APIs: dnotify, inotify and fsnotify. Explain the
694 difference and describe in one paragraph how applications make use
695 of certain file descriptors to communicate with the kernel in order
696 to track file change events.
697
698 »)
699
700 SECTION(«ext, ext2, ext3, ext4»)
701
702 <p> When the first versions of Linux were released in 1991, the kernel
703 relied on the <em>minix</em> filesystem whose source code could
704 be used freely in education. However, this filesystem was designed
705 for education rather than for real use and had severe limitations.
706 For example it supported only file names up to 14 characters long,
707 and a total size of 64M. </p>
708
709 <p> In 1992 the <em>extended filesystem</em> (ext) was released as the
710 first filesystem that was created specifically for Linux. It already
711 made use of the VFS API that was introduced at the same time. In 1993,
712 ext was superseded by ext2, and later by ext3 (2001) and ext4 (2008).
713 While ext2 is a separate filesystem, the ext3 and ext4 filesystems
714 share the same implementation. </p>
715
716 <p> Over time, many new features were added to ext3 and later to ext4.
717 For example, journaling was one of the main features that ext3 added
718 on top of ext2 while delayed allocation and on-disk data structure
719 checksumming came with ext4. In the remainder of this section we look
720 at the way the ext* filesystems lay out their data and metadata blocks
721 and describe some of the features of ext3 and ext4. </p>
722
723 SUBSECTION(«Block Layout»)
724
725 <p> All ext* filesystems lay out the space of the underlying block
726 device in a traditional way, inspired by the original Unix
727 filesystem, ufs. The available blocks are partitioned into <em>block
728 groups</em>. Each block group is typically 128M large and always
729 contains its own set of tables and bitmaps that manage the blocks of
730 the group. If feasible, the data blocks of each file are kept in the
731 same block group as its inode and the containing directory to avoid
732 unnecessary seeks. As of ext4, block groups can be combined to form
733 larger, so-called <em>flexible</em> block groups to improve metadata
734 locality and to have large files laid out sequentially. </p>
735
736 <p> Inodes are referenced in the <em>inode table</em>, and a bitmap keeps
737 track of allocated and unallocated inodes. A copy of the filesystem
738 superblock is stored in several other block groups since the superblock
739 is critical for the integrity of the filesystem. </p>
740
741 <p> The directory entries of an ext or ext2 filesystem are laid out
742 in the traditional way. That is, the directory names and associated
743 information like file type and the inode number are listed in the data
744 blocks that correspond to the directory. Directories can be imagined
745 as a 3-column table where each line contains an inode number, a file
746 type number, and the path component. Since searching a linear array
747 performs poorly, ext3 implemented B-trees to speed up name lookups
748 in large directories. The nodes of the tree are keyed by the hashes
749 of the names of the directory entries. </p>
750
751 SUBSECTION(«Journaling Modes»)
752
753 <p> If journaling is enabled, metadata updates are always journaled
754 (i.e., a log record is written to the journal first). However, this
755 is not always the case for data blocks. The <code>data</code> mount
756 option for ext3 and ext4 specifies how writeout of data blocks works.
757 The following three journaling modes offer different trade-offs
758 between speed and data integrity. </p>
759
760 <dl>
761 <dt> data=journal </dt>
762
763 <dd> This is the slowest, but also the safest journaling mode. In
764 this mode all data blocks are journaled just like the metadata
765 blocks. Since all data blocks are written twice, this journaling mode
766 has a substantial negative impact on performance. </dd>
767
768 <dt> data=ordered </dt>
769
770 <dd> This is the default value, which offers a good trade-off between
771 speed and data integrity. Ordered means that metadata blocks are only
772 updated after the corresponding data blocks have been written out. In
773 other words, data is written directly to the filesystem (rather than
774 the journal as with <code>data=journal</code>), but the update of the
775 corresponding metadata blocks is deferred until all data blocks have
776 been written. </dd>
777
778 <dt> data=writeback </dt>
779
780 <dd> This is the fastest mode of operation. No ordering between data
781 and metadata blocks is enforced. Filesystem integrity is guaranteed
782 because metadata is still journaled. However, after an unclean
783 shutdown and the subsequent replay of the journal, files which were
784 under writeback at the time of the crash may contain stale data. This
785 happens if the metadata blocks have been updated to report the larger
786 size but the corresponding data did not make it to the disk before
787 the crash. </dd>
788 </dl>
789
790 SUBSECTION(«Extents»)
791
792 <p> The ext2 and ext3 filesystems employ the traditional indirect block
793 scheme, which is basically a table of block numbers which map to the
794 blocks that comprise the contents of the file. One feature of ext4
795 are <em>extent trees</em>, which are a more efficient data structure
796 to describe this mapping because file contents are often laid out
797 sequentially on the underlying block device. In this case, if the file
798 is large it saves metadata space if the block numbers are not stored
799 as a long list but as <em>extents</em>, that is, ranges of successive
800 blocks. Extents have to be organized in a tree to quickly map a given
801 file offset to the block number that contains the file contents at this
802 offset. The first few extents of the extent tree, including its root,
803 are stored in the inode itself. Only files which need more extents
804 require extra metadata blocks for the nodes of the extent tree. </p>
805
806 SUBSECTION(«Growing and Shrinking»)
807
808 <p> Filesystems often need to be grown, but sometimes it is also handy
809 to shrink a filesystem, for example to redistribute the storage between
810 the logical volumes of a volume group of fixed size. A feature which
811 distinguishes ext* from xfs is that ext2, ext3 and ext4 filesystems can
812 be shrunk while xfs filesystems can only be grown. However, to shrink
813 an ext* filesystem, the filesystem must be offline. That's in contrast
814 to online growing, which is supported for both ext* and xfs. </p>
815
816 <p> To shrink a filesystem which is stored on a logical volume,
817 one needs to convey the new size to both the device mapper and the
818 filesystem. It is usually a good idea to run <code>fsck(8)</code> after
819 the filesystem has been unmounted and before <code>resize2fs</code>
820 is run to shrink it. After the filesystem has been shrunk, the next
821 step is to shrink also the underlying LV. Of course the new size of
822 the LV must not be smaller than the size of the shrunk filesystem. To
823 avoid this from happening, for example due to rounding errors,
824 it's best to make the LV slightly larger than necessary and then
825 enlarge the filesystem to the maximal possible size by running
826 <code>resize2fs(8)</code> without specifying the new size. </p>
827
828 EXERCISES()
829
830 <ul>
831 <li> Using default options, create an ext2, ext3 and ext4 filesystem.
832 Compare the three superblocks by examining the output of <code>dump2fs
833 -h &lt;device&gt;</code>. </li>
834
835 <li> Examine the sysfs interface of ext4 which is available through
836 the files below <code>/sys/fs/ext4</code>.
837
838 <li> Consider an application which aims to replace an existing
839 file with the rewrite-and-replace method indicated by the code
840 <a href="«#»broken_rename">below</a> (error checking omitted).
841 Describe the log records which each of the four system calls generates.
842 Explain why the approach is inherently buggy and may result in an
843 empty file <code>foo</code> even if the filesystem is mounted with
844 <code>data=ordered</code> (the default journaling mode). How does
845 the <code>auto_da_alloc</code> mount option of ext4 try to mitigate
846 the chance of data loss? </li>
847
848 <li> Why is data journaling incompatible with encryption? Hint: Commit
849 73b92a2a5e97 in the linux kernel repository. </li>
850
851 <li> Describe a scenario where ext2 is more suitable than ext3 or
852 ext4. </li>
853 </ul>
854
855 HOMEWORK(«
856 <ul>
857 <li> Provide a step-by step procedure to set up an encrypted ext4
858 filesystem. </li>
859
860 <li> Explain how the <em>MMP</em> feature of ext4 helps to protect
861 the filesystem from being multiply mounted. </li>
862
863 </ul>
864 »)
865
866 SECTION(«The Extents Filesystem (xfs)»)
867
868 xfs is a journaling filesystem which was implemented in 1993 for the
869 IRIX operating system and was ported to Linux in 2001. While IRIX was
870 discontinued in 2005, the Linux port of xfs is actively maintained
871 and new features and improvements are added regularly.
872
873 To optimize for performance, xfs departs from the traditional approach
874 that is followed by the ext* family. From the beginning xfs was
875 designed for running on high-end servers where plenty of resources
876 are available to max out even the largest and fastest storage systems,
877 and to perform well under high load when multiple threads access
878 the filesystem concurrently. The implementation relies on B-trees for
879 data, metadata and free space management. xfs is not a COW filesystem,
880 though, and does not do any form of block device management. Tasks like
881 encryption, raid or volume management are left to the corresponding
882 filesystem-agnostic block layer interfaces, for example MD (Software
883 Raid) and LVM (the logical volume manager).
884
885 Unlike the rather static layout of the ext* filesystems, metadata on
886 xfs is dynamically allocated. Consequently, the metadata structures
887 have to be discovered at mount time by walking the filesystem
888 structure. Metadata blocks are self-describing in that each metadata
889 object contains a unique identifier, <em>the log sequence number</em>,
890 which plays the role of a timestamp. CRC32c checksums are used to
891 detect corruption: when the block is read, the CRC32c value is
892 recomputed and checked to verify to integrity of the object.
893
894 SUBSECTION(«Allocation groups»)
895
896 Like the block groups of the ext* family, an xfs filesystem is
897 divided into several "mini filesystems" called <em>Allocation
898 Groups</em> (AGs). This allows xfs to handle operations in parallel,
899 which improves performance if many unrelated processes access the
900 filesystem simultaneously. New directories, are always placed in a
901 different AG than its parent and the inodes of the files in the new
902 directory are clustered around the directory if possible.
903
904 AGs can be up to 1T large, which is much larger than the block groups
905 of the ext* family, but still small enough to allow relative AG
906 pointers to be 32 bits wide rather than 64. Each AG maintains its own
907 superblock and its own set of B-trees for resource management. Files
908 and directories are allowed to span multiple AGs, so the AG size does
909 not limit the maximal file size.
910
911 The first AG is the <em>primary</em> AG. Its superblock is special
912 in that it stores the accumulated counters of all AGs. The secondary
913 superblocks are only consulted by <code>xfs_repair(8)</code>.
914
915 SUBSECTION(«Project Quota Implementation»)
916
917 <p> Project quotas used to be an xfs feature, but the functionality has
918 been made generic and is therefore available to other filesystems as
919 well. Besides xfs, also ext4 supports project quotas. </p>
920
921 <p> To limit the size of an arbitrary subtree, a special inode flag,
922 <code>XFS_DIFLAG_PROJINHERIT</code>, is used. This flag indicates
923 that the directory and all inodes created in the directory inherit
924 the project ID of the directory. Hence the act of creating a file
925 in a <code>XFS_DIFLAG_PROJINHERIT</code> marked directory associates
926 the new file with s a specific project ID. New directories also get
927 marked with <code>XFS_DIFLAG_PROJINHERIT</code> so the behaviour is
928 propagated down the directory tree. </p>
929
930 <p> Project quota is accounted for when moving <em>into</em> an accounted
931 directory tree, but not when moving out of a directory tree into
932 an unaccounted location. Moreover, one can create hard links to an
933 accounted file in an uncontrolled destination (as the inode is still
934 accounted). But it is not allowed to link from an accounted directory
935 into a destination with a different project ID. </p>
936
937 <p> Project IDs may be mapped to names through the
938 <code>/etc/projid</code> and <code>/etc/projects</code> configuration
939 files. </p>
940
941 SUBSECTION(«Speculative Preallocation»)
942
943 <p> As files are being written, xfs allocates extra blocks beyond the
944 current end of file, anticipating that further writes will arrive to
945 extend the file. The preallocation size is dynamic and depends mainly
946 on the size of the file. When the file is closed, the unneeded extra
947 blocks are reclaimed. </p>
948
949 <p> The speculatively preallocated post-EOF blocks help to minimize
950 file fragmentation, but they can cause confusion because they are
951 accounted identically to other blocks, making files appear to use
952 more data blocks than expected. </p>
953
954 SUBSECTION(«Reverse Mapping»)
955
956 <p> This feature was implemented in 2018. It adds yet another B-tree to
957 the xfs on-disk data structures: the <em> reverse mapping tree</em>,
958 which allows the filesystem to look up the owner of a given block
959 (if any). For example, if the underlying storage device reports that
960 a certain block went bad, and that block happens to contain contents
961 of a regular file, the reverse mapping tree yields the corresponding
962 inode and the file offset. </p>
963
964 <p> Another use of the reverse mapping tree is <em>filesystem
965 scrubbing</em>, a data integrity technique where a kernel thread
966 runs in the background to check the on-disk data structures for
967 consistency while the filesystem is mounted. </p>
968
969 <p> Since reverse mapping imposes some performance overhead, the
970 feature is disabled by default. </p>
971
972 SUBSECTION(«mkfs Options»)
973
974 Generally, the default settings are suitable for most workloads,
975 so there is usually no need for manual optimization. Nevertheless,
976 many xfs parameters can be tweaked at filesystem creation time. The
977 following list describes some options to <code>mkfs(8)</code>.
978
979 <ul>
980 <li> AG count. The optimal number of AGs for a given block device
981 depends on the underlying storage. Since a single device has
982 limited IO concurrency capability while raid devices allow for
983 much more concurrency, the number of AGs for the single device
984 should be smaller. <code>mkfs.xfs(8)</code> tries to figure out the
985 characteristics of the block device and pick a suitable number of
986 AGs. However, in some cases, for example if the filesystem is stored
987 on a hardware raid array or on a bcache device, it can not detect the
988 geometry of the underlying storage. Explicitly specifying the number
989 of AGs is probably a good idea in this case. As a rule of thumb,
990 32 AGs should be enough for most cases.
991
992 <p> The number of AGs can not be changed after the filesystem has been
993 created. However, when an xfs filesystem is grown, new AGs are added,
994 and care should be taken to align the device size to a multiple of
995 the AG size. Generally, one should not increase the filesystem many
996 times in small steps. </p> </li>
997
998 <li> Stripe unit and stripe width. <code>mkfs.xfs(8)</code> tries
999 hard to choose reasonable defaults, but for hardware raid arrays,
1000 the command has no way to tell the raid level and the number of
1001 data disks in the array. Without this number it is beneficial to
1002 specify the <code>sunit</code> and <code>swidth</code> options to
1003 <code>mkfs.xfs(8)</code>. </li>
1004 </ul>
1005
1006 EXERCISES()
1007
1008 <ul>
1009 <li> Run <code>df -h $P</code> and <code>df -hi $P</code>, where
1010 <code>P</code> is a path on which project quotas are enforced. </li>
1011
1012 <li> Create two directories and set up project quotas for both, using
1013 different project IDs. Guess what happens if one tries to hard link
1014 two files between the two directories. Verify by running a suitable
1015 <code>ln</code> command. </li>
1016
1017 <li> Run <code>xfs_bmap -v file</code> for a large file on an xfs
1018 filesystem to see the extents of the file. </li>
1019
1020 <li> Run <code>xfs_logprint -t</code> to dump the log records of a
1021 busy xfs filesystem. </li>
1022
1023 <li> Run <code>xfs_info(8)</code> on an xfs mountpoint and examine
1024 the values shown in the data section.</li>
1025
1026 <li> Which of the three journaling modes of ext4 (if any) corresponds
1027 to the journaling mode of xfs? </li>
1028
1029 <li> Compile the xfsprogs and xfstests packages. Run xfstests on an
1030 empty xfs filesystem. </li>
1031
1032 <li> Run <code>xfs_info(8)</code> on an existing xfs filesystem and
1033 determine whether the device size is an integer multiple of the AG
1034 size. Discuss the relevance of this property. </li>
1035
1036 <li> Assume you'd like to create a ~100T large xfs filesystem on
1037 a logical volume so that the device size is an integer multiple
1038 of the AG size. Come up with suitable <code>lvcreate</code> and
1039 <code>mkfs.xfs(8)</code> commands to achieve this. </li>
1040
1041 <li> Given a path to a file on an xfs filesystem that is mounted with
1042 project quotas enabled, how can one determine the project ID of the
1043 file? </li>
1044 </ul>
1045
1046 HOMEWORK(«
1047
1048 Summarize how the reflink feature is implemented in xfs.
1049
1050 »)
1051
1052 HOMEWORK(«
1053
1054 Explain how xfs metadumps work and which parts of the filesystem are
1055 included in the dump. Provide a formula to estimate the expected
1056 size of the dump, given the outputs of <code>xfs_info(8)</code>
1057 and <code>df(1)</code>.
1058
1059 »)
1060
1061 SECTION(«The Network Filesystem (nfs)»)
1062
1063 The nfs service allows computers to mount a directory located on
1064 a remote server as if it were a local disk, allowing file sharing
1065 over a (typically local) network. On Linux, both the server and the
1066 client are part of the operating system kernel, but there are also
1067 nfs implementations which operate in user space. The nfs protocol
1068 is an open specification, available as a set of RFCs, that has been
1069 implemented on various operating systems including Linux, FreeBSD
1070 and MacOS. Server and client can run different operating systems as
1071 long as they both support a common nfs protocol version.
1072
1073 The original nfs protocol was designed in 1984 by Sun Microsystems for
1074 the Sun operating system (SunOS). This version was never released and
1075 was only deployed inside the company. Protocol version 2 was released
1076 in 1989. It is long obsolete due to its severe limitations, for example
1077 it had a maximal file size of 2G. Its successor, protocol version 3,
1078 was released in 1995 and fixed most of the limitations. This version
1079 is still in use, although nfs protocol version 4 (called nfs4 in what
1080 follows) is most frequently deployed these days. It was released
1081 in 2000 and contains several performance, robustness and security
1082 improvements over the older versions. The authorative resource for
1083 the gory details of nfs4 is RFC 7530. The nfs protocol is still under
1084 active development. Protocol version 4.1 was released in 2010 and
1085 version 4.2 followed in 2016.
1086
1087 SUBSECTION(«rpc and xdr»)
1088
1089 <p> The nfs protocols are built on top of a concept called <em>remote
1090 procedure call</em> (rpc), which is based on an encoding format known
1091 as <em>external data representation</em> (xdr). The rpcs which are
1092 provided by the nfs server are closely related to filesystem-specific
1093 system calls like <code>read(2), write(2), link(2), rename(2),
1094 mkdir(2)</code> etc. Therefore an introduction to nfs naturally starts
1095 with rpc and xdr. </p>
1096
1097 <p> The functionality of a network service can often be divided into
1098 the low-level part and the application-level part. The low-level
1099 part talks to the kernel to establish the connection and to send
1100 and receive data, using system calls like <code>socket(2), bind(2),
1101 listen(2), connect(2), recv(2), send(2)</code>, etc. This part is
1102 independent of the application layer which is only concerned with
1103 the network protocol of the service. For a service like nfs which
1104 combines more than one network protocol, it makes sense to abstract
1105 out the common low-level part. The rpc framework was designed in 1976
1106 to provide such an abstraction. It supports a variety of transports
1107 including tcp and udp. With rpc, a program running on one computer can
1108 execute a function on a different computer. The functions that can
1109 be called in this manner, the <em>rpc services</em>, are identified
1110 by a program number and the version number. Originally developed by
1111 Sun Microsystems in the 1980s, rpc is still in use today, sometimes
1112 still under the old "sunrpc" name. </p>
1113
1114 <p> In general, the called procedure runs on a different system as the
1115 calling procedure, so the client and server processes don't share the
1116 same address space, and no memory references can be passed. Instead,
1117 data structures must be <em>serialized</em> first, i.e. converted to a
1118 certain transfer format that can be stored in a single memory buffer,
1119 the xdr buffer, which is then sent to the server. The received xdr
1120 buffer is <em>de-serialized</em> (decoded) by the server, possibly
1121 in a different way. For example, the server might store the bytes
1122 which describe an integer value in a different order than the client
1123 to meet the requirements of its CPU (little/big endian). The xdr API
1124 offers routines to convert many predefined data types (int, string,
1125 etc.) to an xdr buffer or vice versa. This unburdens the programmer
1126 from such details as much as possible. </p>
1127
1128 <p> To activate rpc on a system, the <code>rpcbind(8)</code> daemon
1129 (formerly known as portmapper) must be running. This daemon manages
1130 the various procedures employed by nfs such as mount, lock manager,
1131 quota daemon, and the nfs procedure itself. It communicates with
1132 rpc clients by listening on a well-known port. Clients first send a
1133 <code>get_port</code> request to <code>rpcbind(8)</code> in order to
1134 find out the port number which corresponds to the procedure they are
1135 interested in. For example, an nfs client which intends to mount an
1136 nfs-exported directory requests the port number of the mount procedure
1137 from <code>rpcbind(8)</code>. A second request is then made to actually
1138 mount the filesystem. The exercises of this section ask the reader to
1139 run the <code>rpcinfo(8)</code> tool to show the available procedures
1140 and their port numbers on the specified server. </p>
1141
1142 <p> The input format for rpc is the <em>rpc language</em> (rpcl),
1143 which is similar to C. This format fully describes the application
1144 protocol, including all procedures and data types. RFC 7531 contains
1145 the full xdr description of nfs4 in rpcl. The <code>rpcgen(1)</code>
1146 protocol compiler generates C code from rpcl input. The C code is
1147 then compiled to generate application code which implements the
1148 protocol described by the input. <code>rpcgen(8)</code> offers
1149 multiple application interfaces which provide different degrees of
1150 control over the rpc internals. </p>
1151
1152 SUBSECTION(«Stateless and Stateful Protocols»)
1153
1154 <p> The nfs protocol versions 2 and 3 are <em>stateless</em>, which
1155 means that that by design the server does not keep track of what
1156 clients do. For example, the server does not remember which files
1157 are currently open. Instead, the client tracks open files and the
1158 current offset of each open file, translating application requests
1159 into suitable protocol messages. While statelessness simplifies crash
1160 recovery for both the client and the server, it also has downsides. For
1161 example, file locking requires the server to maintain the existing
1162 locks and the list of clients which are holding them. Since this
1163 can not be done with a stateless protocol, another rpc service,
1164 the <em>lock daemon</em> (lockd), was added. To recover the state of
1165 locks after a server reboot, yet another rpc service, the <em>status
1166 daemon</em> (statd), had to be introduced. This design added complexity
1167 for no real gain, which is why nfs4 departed from the previous versions
1168 by introducing state. With a stateful protocol it became possible to
1169 combine all related rpc services into a single service which uses a
1170 single TCP port. This simplifies the implementation and also allows
1171 for <em>compound operations</em> where the client sends more than
1172 one request in a singe rpc call. </p>
1173
1174 SUBSECTION(«Identifiers and File Handles»)
1175
1176 <p> File handles describe the file or directory a particular operation
1177 is going to operate upon. For the nfs clients, file handles are
1178 opaque blobs that can only be tested for equality, but which can not
1179 be interpreted in any way. However, for the nfs server a file handle
1180 identifies the corresponding file or directory. Most protocol requests
1181 include a file handle. For example, the LOOKUP and MKDIR operations
1182 both return a file handle to the nfs client. </p>
1183
1184 <p> A file handle consists of three identifiers: a filesystem ID,
1185 an inode number and the so-called <em>generation number</em>. The
1186 filesystem ID and the inode number also exist for local files. They
1187 are derived from the <code>statvfs</code> structure that describes
1188 the exported filesystem and the <code>stat</code> structure of the
1189 inode, respectively. The generation number, however, is only needed
1190 for network file systems. Roughly speaking, the generation number
1191 counts how many times the inode has been re-used. This is necessary to
1192 prevent clients from accessing a file through an existing file handle
1193 if the file was deleted on the server and its inode number has been
1194 re-used subsequently. File handles are based on <em>leases</em>:
1195 The client periodically talks to the server to update its leases. </p>
1196
1197 <p> There is a deep interaction between file handles and the dentry
1198 cache of the vfs. Without nfs, a filesystems can rely on the following
1199 "closure" property: For any positive dentry, all its parent directories
1200 are also positive dentries. This is no longer true if a filesystem
1201 is exported. Therefore the filesystem maps any file handles sent to
1202 nfs clients to <em>disconnected</em> dentries. Any process whose cwd
1203 is on a local fs contributes to the reference counter of the dentry
1204 that corresponds to the directory, and thus prevents the filesystem
1205 from being unmounted. For nfs, this is not possible. More general:
1206 remote applications need a way to refer to a particular dentry,
1207 stable across renames, truncates, and server-reboot. </p>
1208
1209 SUBSECTION(«Attribute Caching»)
1210
1211 <p> Several rpcs return file <em>attributes</em>, i.e., the inode
1212 information which is available for local filesystems through the
1213 <code>stat(2)</code> system call. For example, the <code>LOOKUP</code>
1214 rpc returns a new file handle and the attributes that are associated
1215 with the given path, and the <code>GETATTR</code> rpc returns the
1216 current attributes of the file which corresponds to an existing
1217 file handle. By default, nfs clients cache these metadata. However,
1218 since metadata can change at any time due to file operations from other
1219 clients, the cached information can become stale. Therefore attributes
1220 are cached for only a few seconds, which is thus the duration of the
1221 time window during which metadata modifications caused on a different
1222 nfs client remain undetected. Reducing this time window can result in
1223 flooding the server with <code>GETATTR</code> requests while extending
1224 it increases the chance of returning stale cached data or metadata
1225 to the application. With the <code>noac</code> mount option, the
1226 client asks the server every time it needs to assess file metadata.
1227 However, the option also prohibits <em>data</em> caching, just like
1228 the <code>sync</code> option. This severely impacts performance. </p>
1229
1230 <p> Changes to directories are handled similarly. To detect when
1231 directory entries have been added or removed on the server, the
1232 client watches the directory mtime (nfsv2 and nfsv3) or <em>change
1233 attribute</em> (nfsv4). When the client detects a change, it drops
1234 all cached attributes for that directory. Since the directory's mtime
1235 and the change attributes are cached attributes, it may take some
1236 time before a client notices directory changes. </p>
1237
1238 SUBSECTION(«Data Caching and Cache Consistency»)
1239
1240 <p> nfs clients are usually allowed to cache write operations
1241 because the write caches increase client performance significantly
1242 and reduce the load of the server at the same time, allowing it to
1243 support more clients. However, one side effect of write caching is
1244 that other clients which access the same file at the same time will
1245 not see the changes immediately. The <em>consistency guarantees</em>
1246 of a network file system describe the semantics of such concurrent
1247 file operations. </p>
1248
1249 define(«caco_height», «300»)
1250 define(«caco_width», «100»)
1251 define(«caco_margin», «10»)
1252 dnl: args: y-pos, client-no, text
1253 define(«caco_text», «
1254 <text
1255 x="12"
1256 y="eval($1 * eval((caco_height() - caco_margin()) / 7)
1257 + 2 * caco_margin())"
1258 ifelse(«$2», «1», «stroke="#228" fill="#228"»)
1259 ifelse(«$2», «2», «stroke="#822" fill="#822"»)
1260 ifelse(«$2», «3», «stroke="#282" fill="#282"»)
1261 font-size="20"
1262 >$3</text>
1263 »)
1264 <div>
1265 <svg
1266 width="caco_width()"
1267 height="caco_height()"
1268 xmlns="http://www.w3.org/2000/svg"
1269 xmlns:xlink="http://www.w3.org/1999/xlink"
1270 >
1271 <path
1272 stroke-width="3"
1273 stroke="black"
1274 d="
1275 M 5 1
1276 l 0 eval(caco_height() - caco_margin())
1277 l 3 0
1278 l -3 5
1279 l -3 -5
1280 l 3 0
1281 "
1282 />
1283 caco_text(«0», «1», «open»)
1284 caco_text(«1», «1», «write»)
1285 caco_text(«2», «2», «open»)
1286 caco_text(«3», «1», «close»)
1287 caco_text(«4», «3», «open»)
1288 caco_text(«5», «2», «read»)
1289 caco_text(«6», «3», «read»)
1290 </svg>
1291 </div>
1292
1293 <p> The nfs versions 2 and 3 provide <em>weak cache consistency</em>
1294 which notifies clients about changes made by other clients before
1295 and after an rpc. This concept turned out to be problematic, so
1296 nfsv4 replaced weak cache consistency by <em>close-to-open cache
1297 consistency</em>, which means that an nfs client is only guaranteed
1298 to see the effects of another client's write operation if it opens
1299 the file <em>after</em> the client that wrote to the file has closed
1300 it. </p>
1301
1302 <p> To illustrate close-to-open cache consistency, consider the
1303 scenario illustrated in the diagram on the left where three nfs clients
1304 (as indicated by colors) access the same file. The blue client opens
1305 the file and writes to it while the other two clients only perform read
1306 operations. With close-to-open cache consistency the green client is
1307 guaranteed to see the write operation of the blue client while there
1308 is no such guarantee for the red client. </p>
1309
1310 SUBSECTION(«Delegations»)
1311
1312 nfs4 introduced a feature called <em>file delegation</em>. A file
1313 delegation allows the client to treat a file temporarily as if no
1314 other client is accessing it. Once a file has been delegated to a
1315 client, the client might cache all write operations for this file,
1316 and only contact the server when memory pressure forces the client
1317 to free memory by writing back file contents. The server notifies
1318 the client if another client attempts to access that file.
1319
1320 SUBSECTION(«Silly Renames and Stale File Handles»)
1321
1322 <p> Many applications employ the following old trick to store temporary
1323 data without leaving a stale temp file behind in case the process
1324 crashes or is killed with <code>SIGKILL</code>. They create and open
1325 a temporary file, then call <code>unlink(2)</code> to disassociate
1326 the path from the filesystem tree while retaining the file descriptor
1327 for subsequent I/O operations. </p>
1328
1329 <p> With NFS this does not work because the file descriptor exists
1330 only on the client, and the server doesn't know about it. Consequently
1331 the normal <code>unlink(2)</code> call on the server would delete
1332 the file and free its data blocks. This is why the nfs client just
1333 <em>renames</em> the file to something like <code>.nfs12345</code>
1334 if an application calls <code>unlink(2)</code> to remove it while it
1335 is still open. Only after all the last file descriptor that refers
1336 to the thusly silly-renamed file is closed, the client removes the
1337 file by issuing an appropriate rpc. </p>
1338
1339 <p> This approach is not perfect. For one, if the client crashes,
1340 a stale <code>.nfs12345</code> file remains on the server. Second,
1341 since silly renames are only known to the nfs client, bad things
1342 happen if a different client removes the file. </p>
1343
1344
1345 <p> The file handle which an nfs client received through some earlier
1346 rpc can become invalid at any time due to operations on a different
1347 hosts. This happens, for example, if the file was deleted on the server
1348 or on a different nfs client, or when the directory that contains
1349 the file is no longer exported by the server due to a configuration
1350 change. Subsequent attempts to use this file handle for rpcs then
1351 fail with the <code>ESTALE</code> error. </p>
1352
1353 <p> The exercises below ask the reader to cause silly-renamed files, and
1354 stale file handles. </p>
1355
1356 SUBSECTION(«Performance Tuning»)
1357
1358 There are plenty of mount options for nfs. See <code>nfs(5)</code>
1359 for details. We only cover a couple of the more interesting ones with
1360 respect to performance.
1361
1362 <ul>
1363 <li> <code>soft/hard</code>. hard: nfs requests are retried
1364 indefinitely. Soft: requests are failed eventually. </li>
1365
1366 <li> <code>sync/async</code>. async: nfs client delays sending
1367 application writes to the server. sync: writes cause data to be
1368 flushed to the server before the system call returns. </li>
1369
1370 </ul>
1371
1372 EXERCISES()
1373
1374 <ul>
1375 <li> Run the following commands on an nfs client and discuss the
1376 output: <code>df -h</code>, <code>mount -t nfs -t nfs4</code>. </li>
1377
1378 <li> Run <code>/usr/sbin/rpcinfo -b 100003 3 | awk '{print $2}' |
1379 sort | uniq</code> to list all NFS servers. </li>
1380
1381 <li> Explain the difference between the <code>sync</code> mount option
1382 for nfs and the <code>sync</code> export option of nfsd. </li>
1383
1384 <li> Run the following commands on an nfs server and discuss the
1385 output: <code>/sbin/showmount -e</code>, <code>rpcinfo</code>. </li>
1386
1387 <li> On an nfs server, run <code>collectl -s F -i 5</code> and discuss
1388 the output. </li>
1389
1390 <li> In an nfs-mounted directory, run <code>cat > foo &</code>. Note
1391 that the cat process automatically receives the STOP signal.
1392 Run <code>rm foo; ls -ltra</code>. Read section D2 of the
1393 <a href="http://nfs.sourceforge.net/">nfs HOWTO</a> for the
1394 explanation. </li>
1395
1396 <li> In an nfs-mounted directory, run <code>{ while :; do echo; sleep
1397 1; done; } > baz &</code>. What happens if you remove the file on a
1398 <em>different</em> nfs client? </li>
1399
1400 <li> Discuss the pros and cons of hard vs. soft mounts. </li>
1401
1402 <li> Read section A10 of the <a href="http://nfs.sourceforge.net/">nfs
1403 HOWTO</a> to learn about common reasons for stale nfs handles. </li>
1404
1405 <li> Can every local filesystem be exported via nfs? </li>
1406
1407 <li> What's that readdirplus thing? Describe a scenario where it is
1408 beneficial and another scenario where it hurts performance. </li>
1409 </ul>
1410
1411 HOMEWORK(«
1412
1413 For each POSIX lock type, discuss whether file locks of this type
1414 work on an nfs-mounted filesystem. If they do, discuss the relevant
1415 mount options that are necessary to make them work.
1416
1417 »)
1418
1419 HOMEWORK(«
1420
1421 <ul>
1422 <li> Use the simplest interface that rpcgen has to offer to write a
1423 protocol in rpcl which allows a client to retrieve the load value of
1424 the server as a string (contents of <code>/proc/loadavg</code>). </li>
1425
1426 <li> Write the same program, but this time use the expert interface
1427 and and pass each value of <code>/proc/loadavg</code> as a number of
1428 suitable type. </li>
1429 </ul>
1430
1431 »)
1432
1433 HOMEWORK(«
1434
1435 Describe the purpose of the <code>nfsd</code> and the
1436 <code>rpc_pipefs</code> pseudo filesystems. Hint: See
1437 <code>Documentation/filesystems/nfs/nfs.txt</code> of the Linux
1438 source code.
1439
1440 »)
1441
1442 HOMEWORK(«
1443
1444 Provide an overview of nfs version 4.1 (Parallel nfs).
1445
1446 »)
1447
1448 SUPPLEMENTS()
1449
1450 SUBSECTION(«Broken Rename»)
1451
1452 <pre>
1453 fd = open("foo.new", ...);
1454 write(fd, "new content", ...);
1455 close(fd);
1456 rename("foo.new", "foo");
1457 </pre>
1458
1459 SECTION(«Further Reading»)
1460 <ul>
1461 <li> <code>Documentation/filesystems/vfs.txt</code> of the Linux
1462 kernel source. </li>
1463 <li> Dominic Giampaolo: Practical File System Design </li>
1464 <li> Cormen </li>
1465 <li> Darrick Wong: XFS Filesystem Disk Structures </li>
1466 <li> The <a href="https://xfs.org/index.php/Main_Page">xfs FAQ</a> </li>
1467 <li> Documentation/filesystems/path-lookup.rst </li>
1468 <li> rfc 5531: Remote Procedure Call Protocol, Version 2 (2009) </li>
1469 <li> Birell, A.D. and Nelson, B.J.: Implementing Remote Procedure Calls
1470 (1984) </li>
1471 </ul>