Tux3 Design Philosphy: Layout Optimization

Tue Aug 14 22:33:09 PDT 2012

Tux3 Design Philosphy

Layout Optimization

Tux3 aims to support two distinctly different kinds of block storage hardware well:

  1) Traditional rotating media disk drives
  2) Modern solid state flash memory drives

Rotating media seek time can adversely affect both read and write performance by more than
an order of magnitude it common situations, so considerable design effort must be devoted
to addressing this problem. While it is true that the day is fast approaching when most
computing devices will use solid state devices for mass storage, we are so far not there yet,
and even when that day does arrive, rotating media will still rule the world of archiving
and large system storage for many years to come. Tux3 will enter service in the middle of
the ongoing transition from rotating media to solid state storage and must work well with
both.

There are three major classes of storage layout optimization in Tux3:

  1) Optimizations aimed at rotating media

    - Blocks expected to be needed close together in time should be stored near each other
      Example: directory entry block containing references to inode table blocks, containing
      references to file data blocks. The inode table blocks should be clustered near the
      referencing directory entry block, and file data blocks should be laid out linearly
      not too far away.

    - A block containing a reference to another block that will be needed shortly after
      should be stored before the referenced block.
      Example: a btree index block should be stored at a slightly lower address than its
      child blocks. A data btree leaf block should be stored just below its referenced
      file extents.

    - Log rollup. Without its logging mechanism, Tux3's no-overwrite rule would require any
      change to the filesystem tree to be recursively propagated by copy-on-write all the
      way to the filesystem root. A single byte of changed data would require a large number
      of blocks to be written: all the tree index blocks on the path to the filesystem root,
      bitmap blocks to record the new positions at which the modified blocks are written,
      metadata blocks referencing the bitmap blocks, and (recursively) bitmap blocks to
      record the positions of the new bitmap metadata blocks. In total, this might amount
      to a single digit number of blocks, but it is still considerably more than the
      minimum we would like to achieve. For some loads consisting of a widely scattered
      small writes, the multiplier effect could be quite pronounced. In either case, a
      large amount of seeking will be required. These undesirable effects are mitigated
      by Tux3's central design feature: logging and log rollup. In general, the new
      position of rewritten data is recorded in dirty cache and in a log block, but not
      in the filesystem tree. So for example, writing a single block of a file requires
      the actual data block to be written, one entry to be made in a log block, and the
      position of the log block to be recorded in a metablock. A total of three block
      writes, compared to several times that if a recursive copy on write strategy
      were used. Because the Tux3 log is of limited size in order to limit potential
      replay time and to facility block freeing, from time to time Tux3 executes a
      rollup phase, a special delta commit where log dirty cache blocks are flushed to
      disk and log blocks are released. Even for log rollup, recursive copy on write all
      the way to the root of the filesystem is not required: a rollup phase can generate
      new log entries as convenient. Besides reducing total metadata writes, this reduces
      rotating media seeking by permitting a number of potentially widely separated
      metadata blocks to be written in a single, linear pass of the write head across
      the media.

  2) Optimizations aimed at solid state devices

    - Blocks expected to be freed at the same time should be kept close to each other in
      time and space

    - Flash can only be erased and rewritten a limited number of times before it wears
      out. Thefore it is important to minimize the total volume of data written,
      especially by minimizing metadata writes. For flash, it may be better to log less
      and directly update filesystem index blocks more, in order to minimize the extra
      write traffic required for log rollup.

  3) Optimizations valid for both

    - Compact metadata. Tux3's btree leaf nodes (inode table blocks for the itable btree
      and data extent leaf blocks for file data trees) are optimized for a tradeoff between
      compactness, low processing overhead and predictable space utilization. Compression
      per se is not used because of unpredictable space requirements, which would increase
      the difficulty of providing disk space availability guarantees. Rather, compact data
      data structures are used: block extents for file data and compact, variable sized
      records for inode attributes.

    - See log rollup above, to reduce total metadata writes.

    - In future, optional space saving optimizations may be added such as data and
      metadata compression, tail packing and data deduplication. For the time being,
      simplicity, reliability and predictable storage utilization are the main focus.

There are significant differences between the optimization strategy for solid state and
rotating media. Fortunately, in Tux3 these differences can be well localized in a layout
policy module, with runtime options to optimize for one or the other. Other than layout
policy, all design elements of Tux3 are applicable without change to both kinds of block
storage media.

Note that it is in general impossible to satisfy all layout constraints simultaneously,
particularly when combined with the requirement of writing new data only to blocks marked
free in the most recent committed filesystem image. In particular, the further from the
root of the filesystem tree, the harder it is to keep related blocks near each other.
We must allow clusters of related blocks to disperse themselves at relatively widely
separated locations, or in other words, layout goals for rotating media vary as we move
closer to the root of the filesystem tree.

We also must consider rewrite effects, both because of the need to write to unused locations
(before potentially freeing some of the nearby used locations) and because of the additional
load imposed by versioning, which may keep some rewritten file data from being truly freed
for an arbitrarily long time. There is no one perfect layout strategies to address all these
issues. The plan with Tux3 is to start with some simple rules to ameliorate some of the
worst layout problems, then improve the layout optimization strategy over time. During this
process we can reasonably expect rotating media to become steadily less important in terms
of performance, though never completely unimportant. In general, the operative principle is
that storage layout optimization never affects backward or forward compatibility, and so
may be improved over time without risk of rendering existing data inaccessible.

Solid state media is sensitive to "write multiplication" which manifests as a volume
becomes nearly full. The issue is that, in order to recover space on a relatively large
flash "erase page" on which only a small number of blocks are actually free, all the
other blocks need to be read and rewritten to other free locations so that the entire
erase block becomes erasable. This behavior is hidden behind a "flash translation layer"
over which a filesystem such as Tux3 has little control, yet Tux3 must exercise what
control it has by trying to write blocks that are likely to be freed at the same time
as close together as possible both in space and time.

One powerful tool Tux3 will have available is block migration. Tux3 supports storing any
filesystem block anywhere on a volume, and because each data block (or extent) is singly
referenced, it is relatively straightforward to migrate poorly position blocks to more
optimal locations as the need arised. Additionally, Tux3's well defined cache semantics
and its logging facility facilitate such migration safely under active runtime loads.