Filesystems.m4

   1 TITLE(«
   2
   3         Happy filesystems are all alike, but every corrupted filesystem is
   4         unhappy in its own way. -- Jonathan Corbet (2011)
   5
   6 », __file__)
   7
   8 OVERVIEW(«
   9
  10         The first part of this chapter covers general concepts related to
  11         filesystems. This part is largely independent of the underlying
  12         operating system. The later sections look at some aspects of two
  13         popular local local filesystems for Linux: ext4 and xfs. The last
  14         section contains selected topics related to the network filesystem,
  15         nfs.
  16 »)
  17
  18 SECTION(«Introduction»)
  19
  20 <p> Every Unix system has at least one filesystem which contains
  21 the root of the tree of files. This filesystem, the <em>root file
  22 system</em>, is normally stored on a local block device, for example on
  23 an SSD. On top of that, several additional filesystems of different
  24 types are mounted usually. For example, most Linux distributions
  25 mount the proc and sys pseudo filesystems and several instances of
  26 tmpfs. Other filesystems may be mounted as needed. </p>
  27
  28 SUBSECTION(«Classification»)
  29
  30 <p> The Linux kernel supports several dozens of different filesystems
  31 and new ones are added frequently. Some filesystems are only employed
  32 for special purpose computers while others are in use on almost all
  33 systems. The <code>/proc/filesystems</code> pseudo file contains the
  34 list of supported filesystems. We don't aim to provide a full listing
  35 but classify filesystems as belonging to exactly one of the following
  36 categories.  </p>
  37
  38 <dl>
  39         <dt> local </dt>
  40
  41         <dd> The filesystem is stored on a local block device, and only the
  42         local computer can access it. Examples: ext4, xfs, fat. </dd>
  43
  44         <dt> pseudo </dt>
  45
  46         <dd> These filesystems are characterized by the absence of
  47         backing storage. That is, there is no block device which stores the
  48         contents. Instead, contents exist only in memory, or are provided on
  49         demand (i.e., when the files are accessed). When the filesystem is
  50         unmounted, all data is lost. Examples: tmpfs, sysfs, proc. </dd>
  51
  52         <dt> network </dt>
  53
  54         <dd> This type of filesystem makes files which are physically stored
  55         on a different computer visible on the local computer. Examples: nfs,
  56         cifs. </dd>
  57
  58         <dt> fuse (filesystem in user space) </dt>
  59
  60         <dd> Contents are provided by a user space application. Examples:
  61         sshfs. </dd>
  62
  63         <dt> distributed </dt>
  64
  65         <dd> The contents are stored on more than one computer. Examples:
  66         glusterfs, lustre, nfs-4.1. </dd>
  67
  68 </dl>
  69
  70 SUBSECTION(«POSIX Filesystems»)
  71
  72 <p> Regardless of the category, most native Unix filesystems support
  73 the semantics prescribed by the POSIX.1-2008 standard. These include
  74 <code>open(2), read(2), write(2), stat(2)</code> system calls and
  75 many more. In particular, a POSIX filesystem must store in each
  76 file's metadata the user and group ID of the owner and the usual
  77 three timestamps. For compatibility reasons this is not possible for
  78 "foreign" filesystems like Microsoft's FAT which was designed for
  79 the single-user DOS operating system and thus has no concept of file
  80 ownership. </p>
  81
  82 SUBSECTION(«User, Group, Directory and Project Quotas»)
  83
  84 <p> Early Unix systems already supported user and group quotas while
  85 directory and project quotas are comparatively new concepts. User and
  86 group quotas impose an upper bound on the files owned by a specific
  87 user or Unix group, respectively. Directory quotas restrict the size
  88 of an accounted directory. Project quotas are similar to directory
  89 quotas but are not restricted to single directories. They are realized
  90 as an aggregation of unrelated inodes with a specific identifier,
  91 the <em>project ID</em>, which is stored in each accounted inode.
  92 It is possible for arbitrary files to have the same project ID and
  93 hence be accounted to the same project. The project ID is independent
  94 from the UID and the GID, hence project accounting is independent of
  95 user and group accounting. </p>
  96
  97 <p> For each quota type there are two configurable limits: the
  98 <em>inode limit</em> which imposes a bound on the number of files
  99 and directories owned by a specific user/group/project ID, and the
 100 <em>block limit</em> which bounds the space that the same set of files
 101 is permitted to occupy.  Each limit is in fact a pair of limits: the
 102 <em>soft limit</em> and the <em>hard limit</em>. When utilization
 103 reaches the soft limit, a message is sent but no further action is
 104 taken. Only if the hard limit is reached, subsequent attempts to
 105 request more resources fail with the <code>EDQUOT</code> (<code>Disk
 106 quota exceeded</code>) error. </p>
 107
 108 SECTION(«Filesystem Design»)
 109
 110 In this section we take a closer look at the data structures of local
 111 filesystems, their on-disk layout, and some of the techniques that
 112 make filesystems fast and robust.
 113
 114 SUBSECTION(«Superblocks»)
 115
 116 <p> The <em>superblock</em> describes the <em>geometry</em> of the
 117 file system. Only one superblock is needed for normal operation but
 118 most local filesystems store backup copies of the superblock. These
 119 extra superblocks are only accessed during recovery from filesystem
 120 corruption, for example if the main superblock was overwritten by
 121 accident. Although the details vary between filesystem types, the
 122 following information is typically stored in the superblock: </p>
 123
 124 <ul>
 125         <li> The <em>magic number</em>, or <em>signature</em> which identifies
 126         the block device as containing a filesystem of a certain type. For
 127         example, the magic number for ext2 is 0xef534. Newer filesystems use
 128         larger signatures. </li>
 129
 130         <li> The size of the filesystem. </li>
 131
 132         <li> The last mount time.
 133
 134         <li> A universally unique identifier (UUID) which is created randomly
 135         at filesystem creation time. </li>
 136
 137         <li> Utilization counts like the number of free blocks. </li>
 138 </ul>
 139
 140 <p> The blkid library, libblkid, contains a database of filesystems and
 141 of other software which stores its metadata in a specific superblock
 142 of a block device. This includes filesystems, swap devices, physical
 143 volumes for software raid or the logical volume manager. The library
 144 enables applications to identify the contents of on a block device
 145 and to extract additional information like the UUID.  There are a
 146 number of tools which are linked against this library to examine
 147 or modify the superblocks: <code>lsblk(8)</code> to list block
 148 devices, <code>blkid(8)</code> to print block device properties, and
 149 <code>wipefs(8)</code> to overwrite (or merely print) all superblocks
 150 of a given block device. Also the <code>mount(8)</code> executable is
 151 usually linked against libblkid to support mounting by UUID instead
 152 of device name because device names like <code>/dev/sda</code> might
 153 change across reboots. </p>
 154
 155 SUBSECTION(«B-trees and Variants»)
 156
 157 <p> Most filesystems, including the ext4 and xfs filesystems described
 158 in dedicated sections, employ some B-tree variant to manage their
 159 data blocks. This is reason enough to take a closer look at the key
 160 features of this ubiquitous data structure. We won't go into detail,
 161 though. </p>
 162
 163 <p> B-trees were invented in the 1970s by Rudolf Bayer and Ed McCreight
 164 as a data structure for an algorithm that can quickly access a random
 165 block in a particular file stored on on a rotating disk. The parameters
 166 can be tuned to minimize the number of disk operations performed,
 167 i.e., the height of the tree. </p>
 168
 169 <p> It is unclear what the "B" in "B-tree" actually means. It certainly
 170 does not mean "binary", because B-trees typically have a high fanout,
 171 so nodes have way more than two children (which, by definition,
 172 is the maximal number of child nodes for a binary tree). The typical
 173 fanout values for filesystems range from 100 to 1000. </p>
 174
 175 <p> B-trees impose a fixed lower bound on the number of child nodes, and
 176 an upper bound that is twice as large as the lower bound. The upper
 177 bound is called the <em>order</em> of the tree. These bounds imply
 178 an upper bound for the maximal height of the tree, given the number
 179 of leaf nodes. Hence a B-tree is always well balanced, lookup times
 180 are always optimal (i.e., logarithmic), and storage utilization is
 181 at least 50%. Unlike a hash table approach there is no decrease in
 182 performance when utilization approaches 100%. </p>
 183
 184 <p> Addition and removal of nodes is performed in a way that keeps
 185 the tree balanced. For example, if a node is being removed and this
 186 results in a violation of the lower bound on the number of child nodes
 187 of the parent node, nodes are moved between siblings, or siblings
 188 are merged. </p>
 189
 190 <p> A node with <em>k</em> children always has <em>k - 1</em> keys.
 191 For filesystems, the keys can be block numbers, hashes, directory
 192 entries, or the size of block ranges. In any case, the keys act as
 193 separators which divide the child nodes. For example, if the node
 194 contains 3 child nodes and the two keys 23 and 42, then the left child
 195 node contains keys less than 23, the middle child node contains keys
 196 between 23 and 42, and the right child node contains keys greater
 197 than 42. </p>
 198
 199 <p> Many filesystems, including xfs, do not use the classic B-tree
 200 structure outlined above but its variant called <em>B+tree</em>. The
 201 main differences between a B-tree and a B+tree are (a) data records
 202 are stored only in leaf nodes, and (b) leaves are linked together so
 203 that all data records can be traversed in-order by following sibling
 204 pointers. This little detail has far-reaching consequences. Roughly
 205 speaking, the sibling pointers prohibit copy on write for metadata
 206 blocks. </p>
 207
 208 SUBSECTION(«Journaling»)
 209
 210 <p> Single filesystem operations often need to update multiple
 211 blocks. For example, adding a new file to a directory requires three
 212 block updates: the data block which contains the new file's contents,
 213 the metadata block which contains the directory entries, and the
 214 on-disk structures that manage the free blocks. Regardless of the
 215 order in which these updates are performed, an unclean shutdown due
 216 to a power outage or a system crash leads to a corrupt filesystem if
 217 the shutdown happens after the first but before the third update was
 218 performed. Journaling is a capability which avoids this situation by
 219 making multiple block updates atomic, thereby ensuring consistency
 220 after an unclean shutdown. </p>
 221
 222 <p> One of the first journaling filesystems was jfs, introduced 1990 by
 223 IBM for the AIX operating system. The first journaling filesystem for
 224 Linux was reiserfs version 3, which was included in the Linux kernel
 225 in 2001. In the same year Linux gained support for three additional
 226 journaling filesystems: jfs and xfs were ported to Linux from AIX
 227 and IRIX, respectively, and the first stable version of ext3 was
 228 released. </p>
 229
 230 <p> Journaling filesystems retain consistency after an unclean shutdown by
 231 keeping a <em>journal</em> (also known as <em>log</em>) of operations
 232 being performed. The journal is either stored on an separate device or
 233 in a reserved area within the filesystem. The entries of the journal,
 234 the <em>log records</em>, describe <em>transactions</em>, which are
 235 filesystem operations that must be performed atomically. At the next
 236 mount after an unclean shutdown, the journal is <em>replayed</em>,
 237 that is, the recorded transactions are reapplied. The time required
 238 to replay the journal depends only on the size of the journal and
 239 the number of log records, but not on the size of the filesystem
 240 or the number of files. It usually takes only a couple of seconds,
 241 which is considerably faster than a filesystem check/repair run, which
 242 can take hours. </p>
 243
 244 <p> Although a journaling filesystem writes metadata twice, this can
 245 actually <em>increase</em> performance because metadata writes to the
 246 journal are sequential, and when the log entries are committed, writes
 247 can be combined and reordered, which is a win not only for rotating
 248 disks. Since data integrity is usually less important than filesystem
 249 integrity, only metadata (inodes, directory contents) is journaled
 250 by default while data blocks (file contents) are written directly. </p>
 251
 252 SUBSECTION(«Delayed Allocation»)
 253
 254 <p> This term refers to the technique of deferring the decision of
 255 which blocks to allocate until the last possible moment, when blocks
 256 have to be written out due to memory pressure or an explicit sync
 257 request from user space. This technique is employed by xfs and ext4
 258 but not by earlier versions of the ext* family. </p>
 259
 260 <p> Delayed allocation improves the performance of the filesystem because
 261 the allocator has better knowledge of the eventual file size, so
 262 files are more likely to be laid out in an optimal way: data blocks
 263 sequentially, and close to the metadata blocks that describe the
 264 file. Moreover, small temporary files can be fully buffered in memory
 265 and don't cause any block allocation at all if the in-memory data
 266 structures have already been removed when writeout triggers. Delayed
 267 allocation is more effective on large memory systems because with
 268 plenty of memory available, allocations can be deferred for longer. </p>
 269
 270 EXERCISES()
 271
 272 <ul>
 273         <li> On a Linux system, run <code>cat /proc/filesystems</code> to
 274         see all file systems supported on this system. </li>
 275
 276         <li> Discuss to which extent snapshot-capable filesystems can replace
 277         off-site backups. </li>
 278
 279         <li> Run <code>mount | awk '{print $5}' | sort | uniq</code> to
 280         examine the different filesystem types which are mounted. Explain
 281         the purpose of each and determine to which of the above classes the
 282         filesystem belongs. </li>
 283
 284         <li> Understand the impact on performance of the atime mount
 285         option. Check which file systems are mounted with atime enabled. </li>
 286
 287         <li> Discuss the contents of <code>/etc/mtab</code>, and
 288         <code>/proc/filesystems</code>. Check whether <code>/etc/mtab</code>
 289         is a symbolic link. Discuss whether it should be one. </li>
 290
 291         <li> Does a read-only mount of a journaling filesystem modify the
 292         contents of the underlying block device? </li>
 293
 294 </ul>
 295
 296 HOMEWORK(«
 297
 298 Describe the various file locking types mandated by POSIX-2008.1.
 299
 300 »)
 301
 302 HOMEWORK(«
 303
 304 <ul>
 305         <li> Summarize what POSIX <code>sync(2), fsync(2)</code> and
 306         <code>fdatasync(2)</code> guarantee. </li>
 307
 308         <li> Consider the following commands: <code>touch foo; echo bar >
 309         foo; sync; mv foo bar; echo baz > foo; fsync foo</code>. Assume the
 310         system crashed at some point in between. Describe the possible states
 311         of the two files after the journal has been replayed. </li>
 312 </ul>
 313
 314 »)
 315
 316 HOMEWORK(«
 317
 318 <em>Delayed logging</em> is a feature of ext3 which was later ported
 319 to xfs. The term refers to the concept of accumulating changes in
 320 memory before writing them to the journal. Discuss the pros and cons
 321 of delayed logging.
 322
 323 »)
 324
 325 HOMEWORK(«
 326
 327 Explain what <em>reflink</em> copies are and discuss when and how to
 328 use them.
 329
 330 »)
 331
 332 SECTION(«Alternatives to Journaling»)
 333
 334 <p> Although our focus lies in topics related to journaling
 335 filesystems, we shall have a brief look at two different (but
 336 related) approaches which also guarantee consistency after a crash.
 337 Both approaches depart from the traditional way of updating data
 338 and metadata structures. To illustrate the difference, consider
 339 the situation where a line is appended to an existing text file
 340 on a traditional filesystem. The filesystem first updates the data
 341 blocks which contains the file contents. Since the file size and the
 342 modification time have changed, the inode of the file needs to be
 343 updated as well. Finally, the on-disk data structures which keep track
 344 of the used and unused blocks might also need to be updated if a new
 345 block had to be allocated for the additional line. All three updates
 346 are usually performed in-place by overwriting the existing blocks.
 347 The filesystems described in this section are different in that they
 348 avoid such in-place changes. </p>
 349
 350 SUBSECTION(«Log-structured Filesystems»)
 351
 352 <p> A log-structured filesystem only writes sequential log entries which
 353 describe the changes that have been made, essentially treating the
 354 entire space as the journal. The first log-structured filesystem was
 355 developed in the 1990s.  Popular log-structured filesystems in use
 356 today are logfs, ubifs (the <em>unsorted block image filesystem</em>),
 357 and f2fs (the <em>flash-friendly filesystem</em>). </p>
 358
 359 <p> The layout of the data and metadata blocks of a log-structured
 360 filesystem bears advantages and disadvantages. One advantage is that
 361 writes are always sequential, which is particularly good for rotating
 362 disks. However, reading a large file sequentially becomes slower
 363 because of data fragmentation. The purely sequential writes also reduce
 364 the decay of the storage media, particularly flash memory, because
 365 without in-place writes, all blocks are written about the same number
 366 of times. Other advantages of log-structured filesystem are that crash
 367 recovery is conceptionally easy and that snapshots are natural. </p>
 368
 369 <p> The main disadvantage is the overhead incurred by <em>garbage
 370 collection</em>: getting rid of old log entries that have been
 371 superseded by later ones. This overhead is particularly large if free
 372 space is short. Another disadvantage is related to inode management:
 373 since inodes are scattered throughout the disk, and the location of an
 374 inode changes whenever the file is updated, some kind of <em>inode
 375 map</em> is required to locate inodes. Updates to the inode map
 376 can be written to the log sequentially, but the current location
 377 of the inode map must be stored in a special fixed location called
 378 the <em>checkpoint region</em> so that the filesystem can recover
 379 from a crash. Since every write to the checkpoint region causes
 380 a seek, log-structured filesystems update the inode map only once
 381 in a while, for example once per minute. This process is known as
 382 <em>checkpointing</em>. </p>
 383
 384 SUBSECTION(«Copy on Write Filesystems»)
 385
 386 <p> In the context of filesystems, the term <em>copy on write</em>
 387 (CoW) means to not overwrite existing data or metadata blocks as
 388 files are modified. Instead, whenever the filesystem needs to modify
 389 a data or metadata block, it writes the modified block to different,
 390 currently unused location, leaving the contents of the original
 391 block intact. Next, the metadata blocks that need to be altered
 392 are also written to a free location without overwriting existing
 393 metadata blocks. For example, in the scenario outlined above where the
 394 <code>write(2)</code> system call extends an existing file, the three
 395 updates for data, inode and free space information are performed as
 396 writes to unused locations. Next, the filesystem switches to the new
 397 metadata to commit the change. CoW filesystems are always consistent
 398 if this last step can be performed atomically. </p>
 399
 400 <p> Two well-known open source CoW filesystem are zfs and btrfs
 401 (the <em>B-tree filesystem</em>). The first stable release of zfs for
 402 the Solaris operating system appeared in 2005, after four years of
 403 development by Sun Microsystems. Since then zfs has been ported to
 404 several other operating systems including FreeBSD and Linux. However,
 405 the zfs code is licensed under the CDDL license, which is regarded
 406 as incompatible with the GPL. For this reason, zfs is not included
 407 in the official Linux operating system kernel but is available as
 408 a third-party kernel module. Btrfs is another CoW filesystem which
 409 was designed for Linux from the start. It was merged into the Linux
 410 kernel in 2009. The feature sets of zfs and btrfs are similar, but
 411 the implementation details vary. </p>
 412
 413 <p> CoW filesystems also have disadvantages. On a system where multiple
 414 processes append data to different files simultaneously, the data
 415 blocks of each file will be fragmented, which is bad for performance.
 416 For the same reason, metadata blocks are fragmented as well, so
 417 performance suffers if the filesystem contains many files. Another
 418 disadvantage is related to the fact that it is difficult to tell
 419 how much space is going to be needed for a given CoW operation,
 420 which has caused an endless list of bugs that occur when disk space
 421 gets tight. This is why CoW filesystems should never use more than a
 422 certain ratio of the available space. For zfs the recommended limit
 423 is as low as 70%. </p>
 424
 425 EXERCISES()
 426
 427 <ul>
 428         <li> zfs and btrfs are not only filesystems: they include many other
 429         features which are traditionally performed by the volume management
 430         subsystem or by the raid driver. This includes device group handling,
 431         snapshotting, mirroring to get redundancy, striping to increase
 432         performance, and more.
 433
 434         <p> Search the web for "blatant layering violation" and glance over
 435         the (sometimes heated) discussions on whether or not it is a good
 436         idea to combine volume management, raid and filesystems. Then form
 437         your own opinion about the topic. </p> </li>
 438 </ul>
 439
 440 SECTION(«Encryption»)
 441
 442 <p> The dm-crypt device mapper target, which was covered in the <a
 443 href="LVM.html">chapter on LVM</a>, operates at the block level.
 444 It encrypts and decrypts one block at a time, regardless of whether
 445 the block is in use. This is in contrast to <em>filesystem-level
 446 encryption</em>, where encryption is performed by the filesystem
 447 on a per inode basis so that, for example, different files can
 448 be encrypted with different keys. </p>
 449
 450 <p> In this section we look at two filesystem-level encryption
 451 primitives for Linux: ecryptfs (the <em>enterprise cryptographic
 452 filesystem</em>) and fscrypt (<em>filesystem encryption</em>). The
 453 former was included into Linux in 2006 while the latter is much
 454 newer. It was originally part of the <em>flash-friendly filesystem</em>
 455 (f2fs) but has been made generic in 2015. Besides f2fs, also ext4
 456 and ubifs rely on fscrypt for encryption. </p>
 457
 458 <p> By definition, filesystem-level encryption means to encrypt the
 459 contents of regular files. In addition to that, both ecryptfs and
 460 fscrypt also encrypt file names. However, information stored in the
 461 inode is left unencrypted. Therefore, without the encryption key it is
 462 still possible to list directories as usual, but the list will contain
 463 only encrypted filenames. Moreover, file size and timestamps can be
 464 read as for unencrypted files with the standard <code>stat(2)</code>
 465 system call. </p>
 466
 467 SUBSECTION(«ecryptfs»)
 468
 469 <p> ecryptfs is a so-called <em>stacked</em> filesystem. That is,
 470 it relies on an (arbitrary) mounted filesystem as backend storage. An
 471 ecryptfs mount is similar to a bind mount in that it makes the files
 472 stored at source location (the mountpoint of the backend storage)
 473 visible at the target location. However, the source usually contains
 474 only encrypted files while files appear unencrypted at the target
 475 location. Each encrypted file is self-contained in the sense that
 476 it starts with a header which, together with the encryption key, is
 477 sufficient to decrypt the file (and the file name). Hence encrypted
 478 files can be copied between hosts, and encrypted files can be backed
 479 up without telling the backup software the encryption key. </p>
 480
 481 SUBSECTION(«fscrypt»)
 482
 483 <p> fscrypt takes a different approach. It provides encryption
 484 through a general library that can, in principle, be used by any
 485 filesystem. With fscrypt it is possible to store both encrypted and
 486 unencrypted files on the same filesystem. fscrypt has a lower memory
 487 footprint than ecryptfs since it avoids caching filesystem contents
 488 twice. Also, only half as many directory entries and inodes are
 489 needed. Another advantage of fscrypt is that the fscrypt API can be
 490 used by unprivileged users, with no need to mount a second filesystem.
 491 The major drawback of fscrypt is that <code>open(2)</code> system call
 492 fails without the key. Since backup software has to open regular files,
 493 it is not possible to backup encrypted files without the encryption
 494 key. </p>
 495
 496 EXERCISES()
 497 <ul>
 498         <li> Explain why the root directory of the filesystem cannot be
 499         encrypted with fscrypt. </li>
 500 </ul>
 501
 502 HOMEWORK(«
 503
 504 Discuss the pros and cons of filesystem level encryption vs.
 505 block level encryption.
 506
 507 »)
 508
 509 SECTION(«The Virtual Filesystem Switch (vfs)»)
 510
 511 <p> The main task of the vfs is to provide an abstraction for
 512 applications to access files in a uniform way, even if the files
 513 are stored on different filesystems. The vfs is responsible for
 514 parsing path names received from user space via system calls, and
 515 to forward the requests it can not handle itself to the specific
 516 filesystem implementation, thereby associating paths with instances
 517 of mounted filesystems. This encourages a modular filesystem design
 518 where filesystems are opaque entities which provide a certain set of
 519 methods called <em>filesystem operations</em>, for example mounting
 520 the filesystem or opening a file. The modular design helps to avoid
 521 code duplication, which is important for operating systems like Linux
 522 which support many different filesystems. </p>
 523
 524 <p> The first vfs implementation was probably shipped in the Solaris
 525 Unix System in 1985. Linux got its first vfs implementation together
 526 with the <em>extended</em> filesystem, the predecessor of ext2,
 527 ext3 and ext4. All modern operating systems have some sort of vfs,
 528 although implementation details differ. In what follows, we shall
 529 only discuss the Linux vfs. </p>
 530
 531 <p> Filesystems register themselves with the vfs at boot time or
 532 when the kernel module for the filesystem is loaded. The vfs keeps
 533 track of the available filesystem types and all mounts. To perform
 534 efficiently, the vfs maintains several data structures which describe
 535 the characteristics of the tree of files. We look at the most important
 536 data structures below but leave out the rather complicated details
 537 about how the various locking primitives (spinlocks, refcounts, RCU)
 538 are employed to deal with concurrency. </p>
 539
 540 SUBSECTION(«The Dentry Cache»)
 541
 542 <p> A <em>dentry</em> (short for "directory entry") is a data structure
 543 which represents a file or a directory. Dentries contain pointers to
 544 the corresponding inode and to the parent dentry. The vfs maintains
 545 the <em>dentry cache</em>, which is independent of the normal page
 546 cache that keeps copies of file contents in memory. Dentries are
 547 kept in hashed lists to make directory lookups fast. Dentries are
 548 also reference-counted. As long as there is a reference on a dentry,
 549 it can not be pruned from the dentry cache. Unreferenced dentries,
 550 however, can be evicted from the cache at any time due to memory
 551 pressure. Each dentry also has a "looked up" flag which enables the
 552 VFS to evict dentries which have never been looked up earlier than
 553 those which have. </p>
 554
 555 <p> On a busy system the dentry cache changes frequently. For example,
 556 file creation, removal and rename all trigger an update of the dentry
 557 cache. Clearly, some sort of coordination is needed to keep the dentry
 558 cache consistent in view of concurrent changes, like a file being
 559 deleted on one CPU and looked up on another. A global lock would scale
 560 very poorly, so a more sophisticated method called <em>RCU-walk</em> is
 561 employed. With RCU, lookups can be performed without taking locks, and
 562 read operations can proceed in parallel with concurrent writers. </p>
 563
 564 <p> The dentry cache also contains <em>negative</em> entries
 565 which represent nonexistent paths which were recently looked up
 566 unsuccessfully. When a user space program tries to access such a path
 567 again, the <code>ENOENT</code> error can be returned without involving
 568 the filesystem. Since lookups of nonexistent files happen frequently,
 569 failing such lookups quickly enhances performance. For example
 570 <code>import</code> statements from interpreted languages like Python
 571 benefit from the negative entries of the dentry cache because the
 572 requested files have to be looked up in several directories. Naturally,
 573 negative dentries do not point to any inode. </p>
 574
 575 SUBSECTION(«File and Inode Objects»)
 576
 577 <p> Positive entries in the dentry cache point to <em>inode
 578 objects</em>, which are in-memory copies of the on-disk inode
 579 structures maintained by the filesystem. Different dentry cache entries
 580 can map to the same inode object if the underlying filesystem supports
 581 hard links, but entries which refer to directories are unique. The
 582 <code>stat(2)</code> system call can be served without calling into
 583 the filesystem if the path argument of the system call corresponds
 584 to an entry of the dentry cache. </p>
 585
 586 <p> When a file is opened, the vfs allocates a <em>file object</em>
 587 (also known as <em>file description</em> and <em>struct file</em>),
 588 and adds a reference to the file object to the calling process' table
 589 of open files. The index to this table is returned as the <em>file
 590 descriptor</em> from the <code>open(2)</code> system call. The file
 591 object contains a reference to the dentry and to the filesystem
 592 specific methods for the usual operations like <code>read(2)</code>
 593 and <code>write(2)</code>. Also the file offset and the file status
 594 flags (<code>O_NONBLOCK, O_SYNC, etc.</code>) are recorded in the
 595 file object. </p>
 596
 597 <p> Like dentries, file objects are reference-counted. Once the
 598 counter hits zero, the file object is freed.  System calls like
 599 <code>dup(2)</code> and <code>fork(2)</code> increase the reference
 600 count while <code>close(2)</code> and <code>exit(2)</code> decrease
 601 it. However, not all file object references correspond to file
 602 descriptors, since also the kernel itself can hold references to a
 603 file object. For example, the loop device driver and the ecryptfs
 604 stacked filesystem increase the reference counter of the file objects
 605 they work with. Passing a file descriptor from one process to another
 606 via local sockets is another situation where the reference counters
 607 of the affected file object need to be adjusted.  </p>
 608
 609 <p> When an application calls <code>dup(2)</code> to duplicate a
 610 file descriptor, the two copies refer to the same file object and
 611 therefore share the file offset and the status flags of the file
 612 object. An <code>open(2)</code> call, however, creates a new file
 613 object even if the file is already open. </p>
 614
 615 SUBSECTION(«vfs Mounts and vfs Superblocks»)
 616
 617 <p> Another job of the vfs is to keep track of the tree of all mounts
 618 and their properties.  Do do so, the vfs maintains a tree of <em>mount
 619 structures</em>.  Whenever a filesystem is mounted, one such structure
 620 is allocated and linked into the tree.  Among other information,
 621 the mount structure contains various pointers, including pointers to </p>
 622
 623 <ul>
 624         <li> the mount structure of the parent mount, </li>
 625
 626         <li> the dentry that corresponds to the mountpoint (the root of the
 627         mounted filesystem), </li>
 628
 629         <li> the superblock of the mounted filesystem, </li>
 630
 631         <li> the containing mount namespace. </li>
 632 </ul>
 633
 634 Other information stored in the mount structure:
 635
 636 <ul>
 637         <li> the list of child mounts, </li>
 638
 639         <li> the mount propagation rules, </li>
 640
 641         <li> the mount flags (<code>MS_RDONLY, MS_NOEXEC, MS_NOATIME</code>,
 642         etc.). </li>
 643 </ul>
 644
 645 <p> Since a filesystem can be bind-mounted, there can be several mount
 646 structures whose superblock pointers point to the same superblock.
 647 The superblock structure contains the UUID, quota information,
 648 granularity of timestamps, and much more. </p>
 649
 650 EXERCISES()
 651
 652 <ul>
 653         <li> Run a command like <code>strace -e%file ls</code> to see the
 654         (negative) dentry cache in action. </li>
 655
 656         <li> Guess which system calls the <code>df(1)</code> command
 657         needs to perform to do its work. Confirm by running <code>strace
 658         df</code>. </li>
 659
 660         <li> The <code>mountpoint(1)</code> command determines if a given
 661         directory is the mountpoint of a filesystem. Explain how the command
 662         is implemented. </li>
 663
 664         <li> Assume a process opens a file by calling <code>open(2)</code>,
 665         then forks so that the child process inherits the file descriptor.
 666         Assume further that the child calls <code>lseek(2)</code> to reposition
 667         the file offset. Will the parent process see the modified offset? </li>
 668
 669         <li> Is it safe to run <code>fsck(8)</code> on a filesystem which has
 670         been lazily unmounted (by executing <code>mount -l</code>)? </li>
 671
 672         <li> What is the purpose of the UUID of a filesystem? Why is it
 673         generally a good idea to put the UUID instead of the device name as
 674         the first field in <code>/etc/fstab</code>? Figure out how to print
 675         the UUID of an ext4 and an xfs filesystem. </li>
 676
 677         <li> When a block device containing a filesystem is cloned with
 678         <code>dd(8)</code>, the two UUIDs match. The same is true if the block
 679         device was snapshotted with LVM (regardless of whether thin or regular
 680         snapshots were used). Discuss the consequences of this fact. </li>
 681
 682         <li> For a network filesystem like nfs, a dentry can become invalid if
 683         the corresponding file was modified on the server or by another
 684         client. Discuss the implications of this fact with respect to the
 685         dentry cache. </li>
 686
 687         <li> The <code>umount -a</code> command unmounts all filesystems which
 688         are not busy. Describe an algorithm which walks the mount tree to
 689         implement the command, taking into account that a mount is busy if
 690         it has at least one child mount.  </li>
 691 </ul>
 692
 693 HOMEWORK(«
 694
 695 Describe the concept of a file change notification system and discuss
 696 possible use cases. The Linux kernel contains three different file
 697 change notification APIs: dnotify, inotify and fsnotify. Explain the
 698 difference and describe in one paragraph how applications make use
 699 of certain file descriptors to communicate with the kernel in order
 700 to track file change events.
 701
 702 »)
 703
 704 SECTION(«ext, ext2, ext3, ext4»)
 705
 706 <p> When the first versions of Linux were released in 1991, the kernel
 707 relied on the <em>minix</em> filesystem whose source code could
 708 be used freely in education. However, this filesystem was designed
 709 for education rather than for real use and had severe limitations.
 710 For example it supported only file names up to 14 characters long,
 711 and a total size of 64M. </p>
 712
 713 <p> In 1992 the <em>extended filesystem</em> (ext) was released as the
 714 first filesystem that was created specifically for Linux. It already
 715 made use of the VFS API that was introduced at the same time. In 1993,
 716 ext was superseded by ext2, and later by ext3 (2001) and ext4 (2008).
 717 While ext2 is a separate filesystem, the ext3 and ext4 filesystems
 718 share the same implementation. </p>
 719
 720 <p> Over time, many new features were added to ext3 and later to ext4.
 721 For example, journaling was one of the main features that ext3 added
 722 on top of ext2 while delayed allocation and on-disk data structure
 723 checksumming came with ext4. In the remainder of this section we look
 724 at the way the ext* filesystems lay out their data and metadata blocks
 725 and describe some of the features of ext3 and ext4. </p>
 726
 727 SUBSECTION(«Block Layout»)
 728
 729 <p> All ext* filesystems lay out the space of the underlying block
 730 device in a traditional way, inspired by the original Unix
 731 filesystem, ufs. The available blocks are partitioned into <em>block
 732 groups</em>. Each block group is typically 128M large and always
 733 contains its own set of tables and bitmaps that manage the blocks of
 734 the group. If feasible, the data blocks of each file are kept in the
 735 same block group as its inode and the containing directory to avoid
 736 unnecessary seeks. As of ext4, block groups can be combined to form
 737 larger, so-called <em>flexible</em> block groups to improve metadata
 738 locality and to have large files laid out sequentially. </p>
 739
 740 <p> Inodes are referenced in the <em>inode table</em>, and a bitmap keeps
 741 track of allocated and unallocated inodes. A copy of the filesystem
 742 superblock is stored in several other block groups since the superblock
 743 is critical for the integrity of the filesystem. </p>
 744
 745 <p> The directory entries of an ext or ext2 filesystem are laid out
 746 in the traditional way. That is, the directory names and associated
 747 information like file type and the inode number are listed in the data
 748 blocks that correspond to the directory. Directories can be imagined
 749 as a 3-column table where each line contains an inode number, a file
 750 type number, and the path component. Since searching a linear array
 751 performs poorly, ext3 implemented B-trees to speed up name lookups
 752 in large directories. The nodes of the tree are keyed by the hashes
 753 of the names of the directory entries. </p>
 754
 755 SUBSECTION(«Journaling Modes»)
 756
 757 <p> If journaling is enabled, metadata updates are always journaled
 758 (i.e., a log record is written to the journal first). However, this
 759 is not always the case for data blocks. The <code>data</code> mount
 760 option for ext3 and ext4 specifies how writeout of data blocks works.
 761 The following three journaling modes offer different trade-offs
 762 between speed and data integrity. </p>
 763
 764 <dl>
 765         <dt> data=journal </dt>
 766
 767         <dd> This is the slowest, but also the safest journaling mode. In
 768         this mode all data blocks are journaled just like the metadata
 769         blocks. Since all data blocks are written twice, this journaling mode
 770         has a substantial negative impact on performance. </dd>
 771
 772         <dt> data=ordered </dt>
 773
 774         <dd> This is the default value, which offers a good trade-off between
 775         speed and data integrity. Ordered means that metadata blocks are only
 776         updated after the corresponding data blocks have been written out. In
 777         other words, data is written directly to the filesystem (rather than
 778         the journal as with <code>data=journal</code>), but the update of the
 779         corresponding metadata blocks is deferred until all data blocks have
 780         been written. </dd>
 781
 782         <dt> data=writeback </dt>
 783
 784         <dd> This is the fastest mode of operation. No ordering between data
 785         and metadata blocks is enforced. Filesystem integrity is guaranteed
 786         because metadata is still journaled. However, after an unclean
 787         shutdown and the subsequent replay of the journal, files which were
 788         under writeback at the time of the crash may contain stale data. This
 789         happens if the metadata blocks have been updated to report the larger
 790         size but the corresponding data did not make it to the disk before
 791         the crash. </dd>
 792 </dl>
 793
 794 SUBSECTION(«Extents»)
 795
 796 <p> The ext2 and ext3 filesystems employ the traditional indirect block
 797 scheme, which is basically a table of block numbers which map to the
 798 blocks that comprise the contents of the file. One feature of ext4
 799 are <em>extent trees</em>, which are a more efficient data structure
 800 to describe this mapping because file contents are often laid out
 801 sequentially on the underlying block device. In this case, if the file
 802 is large it saves metadata space if the block numbers are not stored
 803 as a long list but as <em>extents</em>, that is, ranges of successive
 804 blocks. Extents have to be organized in a tree to quickly map a given
 805 file offset to the block number that contains the file contents at this
 806 offset. The first few extents of the extent tree, including its root,
 807 are stored in the inode itself. Only files which need more extents
 808 require extra metadata blocks for the nodes of the extent tree. </p>
 809
 810 SUBSECTION(«Growing and Shrinking»)
 811
 812 <p> Filesystems often need to be grown, but sometimes it is also handy
 813 to shrink a filesystem, for example to redistribute the storage between
 814 the logical volumes of a volume group of fixed size. A feature which
 815 distinguishes ext* from xfs is that ext2, ext3 and ext4 filesystems can
 816 be shrunk while xfs filesystems can only be grown. However, to shrink
 817 an ext* filesystem, the filesystem must be offline. That's in contrast
 818 to online growing, which is supported for both ext* and xfs. </p>
 819
 820 <p> To shrink a filesystem which is stored on a logical volume,
 821 one needs to convey the new size to both the device mapper and the
 822 filesystem. It is usually a good idea to run <code>fsck(8)</code> after
 823 the filesystem has been unmounted and before <code>resize2fs</code>
 824 is run to shrink it. After the filesystem has been shrunk, the next
 825 step is to shrink also the underlying LV. Of course the new size of
 826 the LV must not be smaller than the size of the shrunk filesystem. To
 827 avoid this from happening, for example due to rounding errors,
 828 it's best to make the LV slightly larger than necessary and then
 829 enlarge the filesystem to the maximal possible size by running
 830 <code>resize2fs(8)</code> without specifying the new size. </p>
 831
 832 SUBSECTION(«Fast Commits»)
 833
 834 <p>
 835
 836 Optimization introduced in 2020 (Linux-5.10).
 837
 838 activated at mkfs time
 839
 840 light-weight journaling method
 841
 842 Idea: parts of the metadata written to the log can instead be derived
 843 from the inode. Leads to more compact format.
 844
 845 Implementation: additional journal for fast commits, i.e., for those
 846 operations that can be optimized. Contains changes at the *file* level
 847
 848 <li> Check <code>/proc/fs/ext4/dev/fc_info</code> to see whether fast
 849 commits are supported in the runing kernel. </li>
 850
 851 </p>
 852
 853 EXERCISES()
 854
 855 <ul>
 856         <li> Using default options, create an ext2, ext3 and ext4 filesystem.
 857         Compare the three superblocks by examining the output of <code>dump2fs
 858         -h &lt;device&gt;</code>. </li>
 859
 860         <li> Examine the sysfs interface of ext4 which is available through
 861         the files below <code>/sys/fs/ext4</code>.
 862
 863         <li> Consider an application which aims to replace an existing
 864         file with the rewrite-and-replace method indicated by the code
 865         <a href="«#»broken_rename">below</a> (error checking omitted).
 866         Describe the log records which each of the four system calls generates.
 867         Explain why the approach is inherently buggy and may result in an
 868         empty file <code>foo</code> even if the filesystem is mounted with
 869         <code>data=ordered</code> (the default journaling mode).  How does
 870         the <code>auto_da_alloc</code> mount option of ext4 try to mitigate
 871         the chance of data loss? </li>
 872
 873         <li> Why is data journaling incompatible with encryption? Hint: Commit
 874         73b92a2a5e97 in the linux kernel repository. </li>
 875
 876         <li> Describe a scenario where ext2 is more suitable than ext3 or
 877         ext4. </li>
 878 </ul>
 879
 880 HOMEWORK(«
 881 <ul>
 882         <li> Provide a step-by step procedure to set up an encrypted ext4
 883         filesystem. </li>
 884
 885         <li> Explain how the <em>MMP</em> feature of ext4 helps to protect
 886         the filesystem from being multiply mounted. </li>
 887
 888 </ul>
 889 »)
 890
 891 SECTION(«The Extents Filesystem (xfs)»)
 892
 893 xfs is a journaling filesystem which was implemented in 1993 for the
 894 IRIX operating system and was ported to Linux in 2001. While IRIX was
 895 discontinued in 2005, the Linux port of xfs is actively maintained
 896 and new features and improvements are added regularly.
 897
 898 To optimize for performance, xfs departs from the traditional approach
 899 that is followed by the ext* family. From the beginning xfs was
 900 designed for running on high-end servers where plenty of resources
 901 are available to max out even the largest and fastest storage systems,
 902 and to perform well under high load when multiple threads access
 903 the filesystem concurrently. The implementation relies on B-trees for
 904 data, metadata and free space management. xfs is not a COW filesystem,
 905 though, and does not do any form of block device management. Tasks like
 906 encryption, raid or volume management are left to the corresponding
 907 filesystem-agnostic block layer interfaces, for example MD (Software
 908 Raid) and LVM (the logical volume manager).
 909
 910 Unlike the rather static layout of the ext* filesystems, metadata on
 911 xfs is dynamically allocated. Consequently, the metadata structures
 912 have to be discovered at mount time by walking the filesystem
 913 structure. Metadata blocks are self-describing in that each metadata
 914 object contains a unique identifier, <em>the log sequence number</em>,
 915 which plays the role of a timestamp. CRC32c checksums are used to
 916 detect corruption: when the block is read, the CRC32c value is
 917 recomputed and checked to verify to integrity of the object.
 918
 919 SUBSECTION(«Allocation groups»)
 920
 921 Like the block groups of the ext* family, an xfs filesystem is
 922 divided into several "mini filesystems" called <em>Allocation
 923 Groups</em> (AGs). This allows xfs to handle operations in parallel,
 924 which improves performance if many unrelated processes access the
 925 filesystem simultaneously. New directories, are always placed in a
 926 different AG than its parent and the inodes of the files in the new
 927 directory are clustered around the directory if possible.
 928
 929 AGs can be up to 1T large, which is much larger than the block groups
 930 of the ext* family, but still small enough to allow relative AG
 931 pointers to be 32 bits wide rather than 64. Each AG maintains its own
 932 superblock and its own set of B-trees for resource management. Files
 933 and directories are allowed to span multiple AGs, so the AG size does
 934 not limit the maximal file size.
 935
 936 The first AG is the <em>primary</em> AG. Its superblock is special
 937 in that it stores the accumulated counters of all AGs. The secondary
 938 superblocks are only consulted by <code>xfs_repair(8)</code>.
 939
 940 SUBSECTION(«Project Quota Implementation»)
 941
 942 <p> Project quotas used to be an xfs feature, but the functionality has
 943 been made generic and is therefore available to other filesystems as
 944 well. Besides xfs, also ext4 supports project quotas. </p>
 945
 946 <p> To limit the size of an arbitrary subtree, a special inode flag,
 947 <code>XFS_DIFLAG_PROJINHERIT</code>, is used. This flag indicates
 948 that the directory and all inodes created in the directory inherit
 949 the project ID of the directory. Hence the act of creating a file
 950 in a <code>XFS_DIFLAG_PROJINHERIT</code> marked directory associates
 951 the new file with s a specific project ID. New directories also get
 952 marked with <code>XFS_DIFLAG_PROJINHERIT</code> so the behaviour is
 953 propagated down the directory tree. </p>
 954
 955 <p> Project quota is accounted for when moving <em>into</em> an accounted
 956 directory tree, but not when moving out of a directory tree into
 957 an unaccounted location. Moreover, one can create hard links to an
 958 accounted file in an uncontrolled destination (as the inode is still
 959 accounted). But it is not allowed to link from an accounted directory
 960 into a destination with a different project ID. </p>
 961
 962 <p> Project IDs may be mapped to names through the
 963 <code>/etc/projid</code> and <code>/etc/projects</code> configuration
 964 files.  </p>
 965
 966 SUBSECTION(«Speculative Preallocation»)
 967
 968 <p> As files are being written, xfs allocates extra blocks beyond the
 969 current end of file, anticipating that further writes will arrive to
 970 extend the file. The preallocation size is dynamic and depends mainly
 971 on the size of the file. When the file is closed, the unneeded extra
 972 blocks are reclaimed. </p>
 973
 974 <p> The speculatively preallocated post-EOF blocks help to minimize
 975 file fragmentation, but they can cause confusion because they are
 976 accounted identically to other blocks, making files appear to use
 977 more data blocks than expected. </p>
 978
 979 <p> If the system crashes while preallocated post-EOF blocks exist,
 980 the space will be recovered the next time the affected file gets closed
 981 (after it has been opened of course) by the normal reclaim mechanism
 982 which happens when a file is being closed. </p>
 983
 984 SUBSECTION(«Reverse Mapping»)
 985
 986 <p> This feature was implemented in 2018. It adds yet another B-tree to
 987 the xfs on-disk data structures: the <em> reverse mapping tree</em>,
 988 which allows the filesystem to look up the owner of a given block
 989 (if any). For example, if the underlying storage device reports that
 990 a certain block went bad, and that block happens to contain contents
 991 of a regular file, the reverse mapping tree yields the corresponding
 992 inode and the file offset. </p>
 993
 994 <p> Another use of the reverse mapping tree is <em>filesystem
 995 scrubbing</em>, a data integrity technique where a kernel thread
 996 runs in the background to check the on-disk data structures for
 997 consistency while the filesystem is mounted. </p>
 998
 999 <p> Since reverse mapping imposes some performance overhead, the
1000 feature is disabled by default. </p>
1001
1002 SUBSECTION(«mkfs Options»)
1003
1004 Generally, the default settings are suitable for most workloads,
1005 so there is usually no need for manual optimization. Nevertheless,
1006 many xfs parameters can be tweaked at filesystem creation time. The
1007 following list describes some options to <code>mkfs(8)</code>.
1008
1009 <ul>
1010         <li> AG count. The optimal number of AGs for a given block device
1011         depends on the underlying storage. Since a single device has
1012         limited IO concurrency capability while raid devices allow for
1013         much more concurrency, the number of AGs for the single device
1014         should be smaller. <code>mkfs.xfs(8)</code> tries to figure out the
1015         characteristics of the block device and pick a suitable number of
1016         AGs. However, in some cases, for example if the filesystem is stored
1017         on a hardware raid array or on a bcache device, it can not detect the
1018         geometry of the underlying storage. Explicitly specifying the number
1019         of AGs is probably a good idea in this case. As a rule of thumb,
1020         32 AGs should be enough for most cases.
1021
1022         <p> The number of AGs can not be changed after the filesystem has been
1023         created. However, when an xfs filesystem is grown, new AGs are added,
1024         and care should be taken to align the device size to a multiple of
1025         the AG size. Generally, one should not increase the filesystem many
1026         times in small steps. </p> </li>
1027
1028         <li> Stripe unit and stripe width. <code>mkfs.xfs(8)</code> tries
1029         hard to choose reasonable defaults, but for hardware raid arrays,
1030         the command has no way to tell the raid level and the number of
1031         data disks in the array. Without this number it is beneficial to
1032         specify the <code>sunit</code> and <code>swidth</code> options to
1033         <code>mkfs.xfs(8)</code>. </li>
1034 </ul>
1035
1036 EXERCISES()
1037
1038 <ul>
1039         <li> Run <code>df -h $P</code> and <code>df -hi $P</code>, where
1040         <code>P</code> is a path on which project quotas are enforced. </li>
1041
1042         <li> Create two directories and set up project quotas for both, using
1043         different project IDs. Guess what happens if one tries to hard link
1044         two files between the two directories. Verify by running a suitable
1045         <code>ln</code> command. </li>
1046
1047         <li> Run <code>xfs_bmap -v file</code> for a large file on an xfs
1048         filesystem to see the extents of the file. </li>
1049
1050         <li> Run <code>xfs_logprint -t</code> to dump the log records of a
1051         busy xfs filesystem. </li>
1052
1053         <li> Run <code>xfs_info(8)</code> on an xfs mountpoint and examine
1054         the values shown in the data section.</li>
1055
1056         <li> Which of the three journaling modes of ext4 (if any) corresponds
1057         to the journaling mode of xfs? </li>
1058
1059         <li> Compile the xfsprogs and xfstests packages. Run xfstests on an
1060         empty xfs filesystem. </li>
1061
1062         <li> Run <code>xfs_info(8)</code> on an existing xfs filesystem and
1063         determine whether the device size is an integer multiple of the AG
1064         size. Discuss the relevance of this property. </li>
1065
1066         <li> Assume you'd like to create a ~100T large xfs filesystem on
1067         a logical volume so that the device size is an integer multiple
1068         of the AG size. Come up with suitable <code>lvcreate</code> and
1069         <code>mkfs.xfs(8)</code> commands to achieve this. </li>
1070
1071         <li> Given a path to a file on an xfs filesystem that is mounted with
1072         project quotas enabled, how can one determine the project ID of the
1073         file? </li>
1074 </ul>
1075
1076 HOMEWORK(«
1077
1078 Summarize how the reflink feature is implemented in xfs.
1079
1080 »)
1081
1082 HOMEWORK(«
1083
1084 Explain how xfs metadumps work and which parts of the filesystem are
1085 included in the dump. Provide a formula to estimate the expected
1086 size of the dump, given the outputs of <code>xfs_info(8)</code>
1087 and <code>df(1)</code>.
1088
1089 »)
1090
1091 SECTION(«The Network Filesystem (nfs)»)
1092
1093 The nfs service allows computers to mount a directory located on
1094 a remote server as if it were a local disk, allowing file sharing
1095 over a (typically local) network. On Linux, both the server and the
1096 client are part of the operating system kernel, but there are also
1097 nfs implementations which operate in user space. The nfs protocol
1098 is an open specification, available as a set of RFCs, that has been
1099 implemented on various operating systems including Linux, FreeBSD
1100 and MacOS. Server and client can run different operating systems as
1101 long as they both support a common nfs protocol version.
1102
1103 The original nfs protocol was designed in 1984 by Sun Microsystems for
1104 the Sun operating system (SunOS). This version was never released and
1105 was only deployed inside the company. Protocol version 2 was released
1106 in 1989. It is long obsolete due to its severe limitations, for example
1107 it had a maximal file size of 2G. Its successor, protocol version 3,
1108 was released in 1995 and fixed most of the limitations. This version
1109 is still in use, although nfs protocol version 4 (called nfs4 in what
1110 follows) is most frequently deployed these days. It was released
1111 in 2000 and contains several performance, robustness and security
1112 improvements over the older versions. The authorative resource for
1113 the gory details of nfs4 is RFC 7530. The nfs protocol is still under
1114 active development. Protocol version 4.1 was released in 2010 and
1115 version 4.2 followed in 2016.
1116
1117 SUBSECTION(«rpc and xdr»)
1118
1119 <p> The nfs protocols are built on top of a concept called <em>remote
1120 procedure call</em> (rpc), which is based on an encoding format known
1121 as <em>external data representation</em> (xdr). The rpcs which are
1122 provided by the nfs server are closely related to filesystem-specific
1123 system calls like <code>read(2), write(2), link(2), rename(2),
1124 mkdir(2)</code> etc. Therefore an introduction to nfs naturally starts
1125 with rpc and xdr. </p>
1126
1127 <p> The functionality of a network service can often be divided into
1128 the low-level part and the application-level part.  The low-level
1129 part talks to the kernel to establish the connection and to send
1130 and receive data, using system calls like <code>socket(2), bind(2),
1131 listen(2), connect(2), recv(2), send(2)</code>, etc. This part is
1132 independent of the application layer which is only concerned with
1133 the network protocol of the service. For a service like nfs which
1134 combines more than one network protocol, it makes sense to abstract
1135 out the common low-level part. The rpc framework was designed in 1976
1136 to provide such an abstraction.  It supports a variety of transports
1137 including tcp and udp. With rpc, a program running on one computer can
1138 execute a function on a different computer. The functions that can
1139 be called in this manner, the <em>rpc services</em>, are identified
1140 by a program number and the version number. Originally developed by
1141 Sun Microsystems in the 1980s, rpc is still in use today, sometimes
1142 still under the old "sunrpc" name. </p>
1143
1144 <p> In general, the called procedure runs on a different system as the
1145 calling procedure, so the client and server processes don't share the
1146 same address space, and no memory references can be passed. Instead,
1147 data structures must be <em>serialized</em> first, i.e. converted to a
1148 certain transfer format that can be stored in a single memory buffer,
1149 the xdr buffer, which is then sent to the server.  The received xdr
1150 buffer is <em>de-serialized</em> (decoded) by the server, possibly
1151 in a different way. For example, the server might store the bytes
1152 which describe an integer value in a different order than the client
1153 to meet the requirements of its CPU (little/big endian). The xdr API
1154 offers routines to convert many predefined data types (int, string,
1155 etc.) to an xdr buffer or vice versa. This unburdens the programmer
1156 from such details as much as possible. </p>
1157
1158 <p> To activate rpc on a system, the <code>rpcbind(8)</code> daemon
1159 (formerly known as portmapper) must be running.  This daemon manages
1160 the various procedures employed by nfs such as mount, lock manager,
1161 quota daemon, and the nfs procedure itself. It communicates with
1162 rpc clients by listening on a well-known port. Clients first send a
1163 <code>get_port</code> request to <code>rpcbind(8)</code> in order to
1164 find out the port number which corresponds to the procedure they are
1165 interested in.  For example, an nfs client which intends to mount an
1166 nfs-exported directory requests the port number of the mount procedure
1167 from <code>rpcbind(8)</code>. A second request is then made to actually
1168 mount the filesystem. The exercises of this section ask the reader to
1169 run the <code>rpcinfo(8)</code> tool to show the available procedures
1170 and their port numbers on the specified server. </p>
1171
1172 <p> The input format for rpc is the <em>rpc language</em> (rpcl),
1173 which is similar to C. This format fully describes the application
1174 protocol, including all procedures and data types. RFC 7531 contains
1175 the full xdr description of nfs4 in rpcl. The <code>rpcgen(1)</code>
1176 protocol compiler generates C code from rpcl input. The C code is
1177 then compiled to generate application code which implements the
1178 protocol described by the input. <code>rpcgen(8)</code> offers
1179 multiple application interfaces which provide different degrees of
1180 control over the rpc internals. </p>
1181
1182 SUBSECTION(«Stateless and Stateful Protocols»)
1183
1184 <p> The nfs protocol versions 2 and 3 are <em>stateless</em>, which
1185 means that that by design the server does not keep track of what
1186 clients do. For example, the server does not remember which files
1187 are currently open. Instead, the client tracks open files and the
1188 current offset of each open file, translating application requests
1189 into suitable protocol messages. While statelessness simplifies crash
1190 recovery for both the client and the server, it also has downsides. For
1191 example, file locking requires the server to maintain the existing
1192 locks and the list of clients which are holding them.  Since this
1193 can not be done with a stateless protocol, another rpc service,
1194 the <em>lock daemon</em> (lockd), was added. To recover the state of
1195 locks after a server reboot, yet another rpc service, the <em>status
1196 daemon</em> (statd), had to be introduced. This design added complexity
1197 for no real gain, which is why nfs4 departed from the previous versions
1198 by introducing state. With a stateful protocol it became possible to
1199 combine all related rpc services into a single service which uses a
1200 single TCP port. This simplifies the implementation and also allows
1201 for <em>compound operations</em> where the client sends more than
1202 one request in a singe rpc call. </p>
1203
1204 SUBSECTION(«Identifiers and File Handles»)
1205
1206 <p> File handles describe the file or directory a particular operation
1207 is going to operate upon. For the nfs clients, file handles are
1208 opaque blobs that can only be tested for equality, but which can not
1209 be interpreted in any way. However, for the nfs server a file handle
1210 identifies the corresponding file or directory. Most protocol requests
1211 include a file handle. For example, the LOOKUP and MKDIR operations
1212 both return a file handle to the nfs client. </p>
1213
1214 <p> A file handle consists of three identifiers: a filesystem ID,
1215 an inode number and the so-called <em>generation number</em>. The
1216 filesystem ID and the inode number also exist for local files. They
1217 are derived from the <code>statvfs</code> structure that describes
1218 the exported filesystem and the <code>stat</code> structure of the
1219 inode, respectively. The generation number, however, is only needed
1220 for network file systems. Roughly speaking, the generation number
1221 counts how many times the inode has been re-used. This is necessary to
1222 prevent clients from accessing a file through an existing file handle
1223 if the file was deleted on the server and its inode number has been
1224 re-used subsequently. File handles are based on <em>leases</em>:
1225 The client periodically talks to the server to update its leases. </p>
1226
1227 <p> There is a deep interaction between file handles and the dentry
1228 cache of the vfs. Without nfs, a filesystem can rely on the following
1229 "closure" property: For any positive dentry, all its parent directories
1230 are also positive dentries. This is no longer true if a filesystem
1231 is exported. Therefore the filesystem maps any file handles sent to
1232 nfs clients to <em>disconnected</em> dentries. Any process whose cwd
1233 is on a local fs contributes to the reference counter of the dentry
1234 that corresponds to the directory, and thus prevents the filesystem
1235 from being unmounted. For nfs, this is not possible.  More general:
1236 remote applications need a way to refer to a particular dentry,
1237 stable across renames, truncates, and server-reboot. </p>
1238
1239 SUBSECTION(«Attribute Caching»)
1240
1241 <p> Several rpcs return file <em>attributes</em>, i.e., the inode
1242 information which is available for local filesystems through the
1243 <code>stat(2)</code> system call. For example, the <code>LOOKUP</code>
1244 rpc returns a new file handle and the attributes that are associated
1245 with the given path, and the <code>GETATTR</code> rpc returns the
1246 current attributes of the file which corresponds to an existing
1247 file handle.  By default, nfs clients cache these metadata. However,
1248 since metadata can change at any time due to file operations from other
1249 clients, the cached information can become stale.  Therefore attributes
1250 are cached for only a few seconds, which is thus the duration of the
1251 time window during which metadata modifications caused on a different
1252 nfs client remain undetected. Reducing this time window can result in
1253 flooding the server with <code>GETATTR</code> requests while extending
1254 it increases the chance of returning stale cached data or metadata
1255 to the application.  With the <code>noac</code> mount option, the
1256 client asks the server every time it needs to assess file metadata.
1257 However, the option also prohibits <em>data</em> caching, just like
1258 the <code>sync</code> option. This severely impacts performance. </p>
1259
1260 <p> Changes to directories are handled similarly. To  detect when
1261 directory entries have been added or removed on the server, the
1262 client watches the directory mtime (nfsv2 and nfsv3) or <em>change
1263 attribute</em> (nfsv4). When the client detects a change, it drops
1264 all cached attributes for that directory. Since the directory's mtime
1265 and the change attributes are cached attributes, it  may  take some
1266 time before a client notices directory changes. </p>
1267
1268 SUBSECTION(«Data Caching and Cache Consistency»)
1269
1270 <p> nfs clients are usually allowed to cache write operations
1271 because the write caches increase client performance significantly
1272 and reduce the load of the server at the same time, allowing it to
1273 support more clients. However, one side effect of write caching is
1274 that other clients which access the same file at the same time will
1275 not see the changes immediately. The <em>consistency guarantees</em>
1276 of a network file system describe the semantics of such concurrent
1277 file operations. </p>
1278
1279 define(«caco_height», «300»)
1280 define(«caco_width», «100»)
1281 define(«caco_margin», «10»)
1282 dnl: args: y-pos, client-no, text
1283 define(«caco_text», «
1284         <text
1285                 x="12"
1286                 y="eval($1 * eval((caco_height() - caco_margin()) / 7)
1287                         + 2 * caco_margin())"
1288                 ifelse(«$2», «1», «stroke="#228" fill="#228"»)
1289                 ifelse(«$2», «2», «stroke="#822" fill="#822"»)
1290                 ifelse(«$2», «3», «stroke="#282" fill="#282"»)
1291                 font-size="20"
1292         >$3</text>
1293 »)
1294 <div>
1295 <svg
1296         width="caco_width()"
1297         height="caco_height()"
1298         xmlns="http://www.w3.org/2000/svg"
1299         xmlns:xlink="http://www.w3.org/1999/xlink"
1300 >
1301         <path
1302                 stroke-width="3"
1303                 stroke="black"
1304                 d="
1305                         M 5 1
1306                         l 0 eval(caco_height() - caco_margin())
1307                         l 3 0
1308                         l -3 5
1309                         l -3 -5
1310                         l 3 0
1311                 "
1312         />
1313         caco_text(«0», «1», «open»)
1314         caco_text(«1», «1», «write»)
1315         caco_text(«2», «2», «open»)
1316         caco_text(«3», «1», «close»)
1317         caco_text(«4», «3», «open»)
1318         caco_text(«5», «2», «read»)
1319         caco_text(«6», «3», «read»)
1320 </svg>
1321 </div>
1322
1323 <p> The nfs versions 2 and 3 provide <em>weak cache consistency</em>
1324 which notifies clients about changes made by other clients before
1325 and after an rpc. This concept turned out to be problematic, so
1326 nfsv4 replaced weak cache consistency by <em>close-to-open cache
1327 consistency</em>, which means that an nfs client is only guaranteed
1328 to see the effects of another client's write operation if it opens
1329 the file <em>after</em> the client that wrote to the file has closed
1330 it. </p>
1331
1332 <p> To illustrate close-to-open cache consistency, consider the
1333 scenario illustrated in the diagram on the left where three nfs clients
1334 (as indicated by colors) access the same file. The blue client opens
1335 the file and writes to it while the other two clients only perform read
1336 operations. With close-to-open cache consistency the green client is
1337 guaranteed to see the write operation of the blue client while there
1338 is no such guarantee for the red client. </p>
1339
1340 SUBSECTION(«File and Directory Delegations»)
1341
1342 <p> nfs4 introduced a per-file state management feature called
1343 <em>file delegation</em>. Once a file has been delegated to a client,
1344 the server blocks write access to the file for other nfs clients and
1345 for local processes. Therefore the client may assume that the file
1346 does not change unexpectedly. This cache-coherency guarantee can
1347 improve performance because the client may cache all write operations
1348 for this file, and only contact the server when memory pressure forces
1349 the client to free memory by writing back file contents. </p>
1350
1351 <p> A drawback of file delegations is that they delay conflicting open
1352 requests by other clients because existing delegations must be recalled
1353 before the open request completes. This is particularly important if
1354 an nfs client which is holding a delegation gets disconnected from
1355 the network. To detect this condition, clients report to the server
1356 that they are still alive by periodically sending a <code>RENEW</code>
1357 request. If no such request has arrived for the <em>lease time</em>
1358 (typically 90 seconds), the server may recall any delegations it
1359 has granted to the disconnected client. This allows accesses from
1360 other clients that would normally be prevented because of the
1361 delegation. </p>
1362
1363 <p> However, the server is not obliged to recall <em>uncontested</em>
1364 delegations for clients whose lease period has expired. In fact,
1365 newer Linux NFS server implementations retain the uncontested
1366 state of unresponsive clients for up to 24 hours. This so-called
1367 <em>courteous server</em> feature was introduced in Linux-5.19
1368 (released in 2022). </p>
1369
1370 <p> Let us finally remark that the delegations as discussed above
1371 work only for regular files. NFS versions up to and including 4.0
1372 do not grant delegations for directories. With nfs4.1 an nfs client
1373 may ask the server to be notified whenever changes are made to the
1374 directory by another client. Among other benefits, this feature allows
1375 for <em>strong directory cache coherency</em>. However, as of 2022,
1376 directory delegations are not yet implemented by Linux. </p>
1377
1378 SUBSECTION(«Silly Renames and Stale File Handles»)
1379
1380 <p> Many applications employ the following old trick to store temporary
1381 data without leaving a stale temp file behind in case the process
1382 crashes or is killed with <code>SIGKILL</code>. They create and open
1383 a temporary file, then call <code>unlink(2)</code> to disassociate
1384 the path from the filesystem tree while retaining the file descriptor
1385 for subsequent I/O operations. </p>
1386
1387 <p> With NFS this does not work because the file descriptor exists
1388 only on the client, and the server doesn't know about it. Consequently
1389 the normal <code>unlink(2)</code> call on the server would delete
1390 the file and free its data blocks.  This is why the nfs client just
1391 <em>renames</em> the file to something like <code>.nfs12345</code>
1392 if an application calls <code>unlink(2)</code> to remove it while it
1393 is still open. Only after all the last file descriptor that refers
1394 to the thusly silly-renamed file is closed, the client removes the
1395 file by issuing an appropriate rpc. </p>
1396
1397 <p> This approach is not perfect. For one, if the client crashes, a
1398 stale <code>.nfs12345</code> file remains on the server. Second, since
1399 silly renames are only known to the nfs client, bad things happen if a
1400 different client removes the file. Finally, if an application running
1401 on a client removes the last regular file in a directory, and this
1402 file got silly-renamed because it was still held open, a subsequent
1403 <code>rmdir</code> will fail unexpectedly with <code>Directory not
1404 empty</code>. Version 4.1 of the NFS protocol finally got rid of
1405 silly renames: An NFS4.1 server knows when it its safe to unlink a
1406 file and communicates this information to the client. </p>
1407
1408 <p> The file handle which an nfs client received through some earlier
1409 rpc can become invalid at any time due to operations on different
1410 hosts. This happens, for example, if the file was deleted on the server
1411 or on a different nfs client, or when the directory that contains
1412 the file is no longer exported by the server due to a configuration
1413 change. Subsequent attempts to use this file handle for rpcs then
1414 fail with the <code>ESTALE</code> error.  </p>
1415
1416 <p> The exercises below ask the reader to cause silly-renamed files, and
1417 stale file handles. </p>
1418
1419 SUBSECTION(«Performance Tuning»)
1420
1421 There are plenty of mount options for nfs. See <code>nfs(5)</code>
1422 for details. We only cover a couple of the more interesting ones with
1423 respect to performance.
1424
1425 <ul>
1426         <li> <code>soft/hard</code>. hard:  nfs requests are retried
1427         indefinitely. Soft: requests are failed eventually. </li>
1428
1429         <li> <code>sync/async</code>. async: nfs  client  delays  sending
1430         application writes to the server. sync: writes cause data to be
1431         flushed to the server before the system call returns. </li>
1432
1433 </ul>
1434
1435 EXERCISES()
1436
1437 <ul>
1438         <li> Run the following commands on an nfs client and discuss the
1439         output: <code>df -h</code>, <code>mount -t nfs -t nfs4</code>. </li>
1440
1441         <li> Run <code>/usr/sbin/rpcinfo -b 100003 3 | awk '{print $2}' |
1442         sort | uniq</code> to list all NFS servers. </li>
1443
1444         <li> Explain the difference between the <code>sync</code> mount option
1445         for nfs and the <code>sync</code> export option of nfsd. </li>
1446
1447         <li> Run the following commands on an nfs server and discuss the
1448         output: <code>/sbin/showmount -e</code>, <code>rpcinfo</code>. </li>
1449
1450         <li> On an nfs server, run <code>collectl -s F -i 5</code> and discuss
1451         the output. </li>
1452
1453         <li> In an nfs-mounted directory (nfs version 4.0 or earlier), run
1454         <code>cat > foo &</code>. Note that the cat process automatically
1455         receives the STOP signal. Run <code>rm foo; ls -ltra</code>. Read
1456         section D2 of the <a href="https://nfs.sourceforge.net/">nfs HOWTO</a>
1457         for the explanation. </li>
1458
1459         <li> In an nfs-mounted directory, run <code>{ while :; do echo; sleep
1460         1; done; } > baz &</code>. What happens if you remove the file on a
1461         <em>different</em> nfs client? </li>
1462
1463         <li> Discuss the pros and cons of hard vs. soft mounts. </li>
1464
1465         <li> Read section A10 of the <a href="https://nfs.sourceforge.net/">nfs
1466         HOWTO</a> to learn about common reasons for stale nfs handles. </li>
1467
1468         <li> Can every local filesystem be exported via nfs? </li>
1469
1470         <li> What's that readdirplus thing? Describe a scenario where it is
1471         beneficial and another scenario where it hurts performance. </li>
1472 </ul>
1473
1474 HOMEWORK(«
1475
1476 For each POSIX lock type, discuss whether file locks of this type
1477 work on an nfs-mounted filesystem. If they do, discuss the relevant
1478 mount options that are necessary to make them work.
1479
1480 »)
1481
1482 HOMEWORK(«
1483
1484 <ul>
1485         <li> Use the simplest interface that rpcgen has to offer to write a
1486         protocol in rpcl which allows a client to retrieve the load value of
1487         the server as a string (contents of <code>/proc/loadavg</code>).  </li>
1488
1489         <li> Write the same program, but this time use the expert interface
1490         and and pass each value of <code>/proc/loadavg</code> as a number of
1491         suitable type. </li>
1492 </ul>
1493
1494 »)
1495
1496 HOMEWORK(«
1497
1498 Describe the purpose of the <code>nfsd</code> and the
1499 <code>rpc_pipefs</code> pseudo filesystems. Hint: See
1500 <code>Documentation/filesystems/nfs/nfs.txt</code> of the Linux
1501 source code.
1502
1503 »)
1504
1505 HOMEWORK(«
1506
1507 Provide an overview of nfs version 4.1 (Parallel nfs).
1508
1509 »)
1510
1511 SUPPLEMENTS()
1512
1513 SUBSECTION(«Broken Rename»)
1514
1515 <pre>
1516         fd = open("foo.new", ...);
1517         write(fd, "new content", ...);
1518         close(fd);
1519         rename("foo.new", "foo");
1520 </pre>
1521
1522 SECTION(«Further Reading»)
1523 <ul>
1524         <li> <code>Documentation/filesystems/vfs.txt</code> of the Linux
1525         kernel source. </li>
1526         <li> Jonathan Corbet: <a href="https://lwn.net/Articles/419811/">
1527         Dcache scalability and RCU-walk</a>. An LWN articile which explains
1528         the dcache in some more detail. </li>
1529         <li> Dominic Giampaolo: Practical File System Design </li>
1530         <li> Cormen </li>
1531         <li> Darrick Wong: XFS Filesystem Disk Structures </li>
1532         <li> Documentation/filesystems/path-lookup.rst </li>
1533         <li> rfc 5531: Remote Procedure Call Protocol, Version 2 (2009) </li>
1534         <li> Birell, A.D. and Nelson, B.J.: Implementing Remote Procedure Calls
1535         (1984) </li>
1536         <li> <a href="https://lwn.net/Articles/897917/">NFS: the early
1537         years</a> and <a href="https://lwn.net/Articles/898262/">NFS: the new
1538         millennium</a>, two articles on the design and history of NFS by Neil
1539         Brown. </li>
1540 </ul>