Design note: Data compression (draft)
daniel.raymond.phillips at gmail.com
Sat Jul 27 02:30:42 PDT 2013
This note examines design and implementation issues of a transparent
data compression feature for Tux3. First, we look at the various parts of
the code base that will be affected, with a brief description of the changes
or additions required. Then design details are considered, including
rationale for the design decisions in terms of efficiency and implementation
We contemplate this work being done in two or three phases. Initially,
compressed data will be stored with no attempt to pack compressed
extent tail fragments efficiently. Expected compression ratio would be
good but not stellar. A second implementation phase would introduce
a space-efficient means of storing compressed tail fragments. A final
clean up phase would remove some restrictions adopted to ease
initial implementation, such as "compression stride", with a further
modest improvement in compression ratio expected.
Overview of Changes
mount options - enable/disable transparent compression globally. Specify
compression method and possibly compression control parameters.
ioctl interface - enable/disable compression for a given file and
possibly specify compression parameters (alternatively, rely on global
mount option parameters). Possibly set/clear a directory "sticky" bit so
that new files created in the directory have compression on/off.
metadata - compressed data size is generally not the same as logical data
size and is measured in bytes as opposed to blocks. Metadata format
(dleaf) must at minimum record the number of physical blocks used by the
extent, which is currently recorded implicitly as the difference between
logical addresses of successive dleaf entries. Only sufficient metadata
needs to be available in the dleaf to allow a compressed block to be loaded.
Metadata not needed to load a compressed extent, such as compression
method and parameters, is recorded in the header of the compressed block
map_region - maps cache page to media data for read or write. Updates
metadata for the latter. Normally, data may be transferred directly between
media and cache pages, but when compression/decompression is required
then the data must be transformed, so the transfer becomes a multi
read path - the traditional block IO library scheme for buffered reads
is not well suited to the multi stage transfer required for decompression.
Rather than work around this we will reimplement that functionality inside
Tux3 in a form suited to decompression, and possibly also better suited to
its traditional task. It is likely that implementing a new cache read scheme
will be of comparable complexity to working around deficiencies of the
traditional scheme, while delivering a superior result.
Our existing metadata formart requires only minor elaboration to accommodate
compression. A dleaf itself only needs to specify how to load a compressed
extent, that is, its location and length. Only the latter is new, because
compressed block count is potentially different from logical count.
Additional details needed to perform the actual compression, such as
compression algorithm, can be stored in the compressed extent itself. So we
need just one new field per dleaf extent entry, the physical extent block
The physical count can be limited to a fairly small number because
there is not much advantage in compressing very long regions in terms of
compression efficiency, and a distinct disadvantage in for random IO. So
8 bits for the compressed block count should be sufficient. Currently, we
provide a 32 bit "version" field in each entry of the dleaf2 format, which is
more precision than necessary. We can reasonably reduce that to 24 bits to
accommodate the new compressed extent count field.
Writing a compressed extent that logically overlaps an existing
compressed extent would require reading the original and rewriting the
entire affected region, either shortening the original or the new extent,
or merging them. Multiple compressed extents could possibly be overlapped.
If uncompressed extents are overlapped by a new compressed extent the
uncompressed extent does not not to be read into cache but can be
shortened just by updating dleaf2 entries.
The complexity of handling overlap on compressed extent write is
considerable, therefore it is expedient to plan an initial implementation
that prevents overlapping compressed writes by adopting a fixed logical
"compression stride". A compression stride of 16 would mean always compress
16 block regions aligned on 16 block boundaries. To save a modest amount of
CPU on arithmetic, the compression stride may be restricted to a power of
Compression tends to improve with the size of input data available. For
typical sliding window schemes like libzip, window size might be up to 64K.
Therefore, a very small compression stride would reduce compression
efficiency. (How much is an open question, which we should research.)
On the other hand, an overly large stride would harm random rewrite
performance. Where the optimum lies is an good question, to which there may
be no fixed answer. However there would appear to be a range of possible
values with both good compression ratio and good random rewrite
performance. For the time being, our default compression stride will be 16
To provide a limited form of forward compatibility, only writes should be
restricted by compression stride. Reads should allow unrestricted extent
alignment. Once chosen for a given file, the compression stride can be made
smaller only at the cost of reintroducing the possibility of write overlap.
Compressed extent tail fragments
With short compressed extents, wastage in the compressed tail fragment
becomes relatively large. If the average size of a compressed extent is
three blocks then wastage in the tail fragment will be roughly 14%, rather
large when the principal objective is to save space. This could be
amortized across a larger compressed extent size, with possible negative
impact on random read/write efficiency, however this will not help with
We therefore desire a means of packing tail fragments together with other
filesystem data or metadata to eliminate this wastage, while hopefully
not greatly reducing read or write efficiency greatly.
The filesystem wastage in the absence of packing could range from zero to
considerably more than 50% in extreme cases, with an expected benefit
in the range of 10%. In compression terms, this is significant, however not
so significant that we are compelled to implement such an optimization
immediately. We therefore regard tail fragment packing as a follow on
project from the initial compression work in the sense that a respectable
compression ratio is expected for typical usage even while wasting
considerable slack space in tail fragements. Nonetheless, we should have a
plausible design for tail packing now to serve as a reference for other
developement work, to facilite this follow on work.
To avoid false sharing and its negative effect on write performance, tail
fragments should be packed together with other data that is functionally or
temporally related in the sense that it tends to be needed in cache at the
same time, and has a similar lifetime on media.
The current Tux3 design and implementation already supports variable number
of variable length attributes in inodes, which looks attractive as a means
of storing compressed tail fragments. The lifetime of a tail fragment
tends to be similar to the lifetime of the inode, and the problem of false
sharing with other files is avoided.
Some difficulties arise. The size of a tail fragment should be limited to
some fraction of the block size in order to avoid introducing harmful
corner cases in btree splitting. However, a tail fragment can be as big as a
block, less a byte. It therefore seems clear that we need to support
storing multiple fragment attributes per tail fragment. We would store the
logical, sub-block offset of each tail fragment in the attribute header.
There may be multiple compressed extents with tail fragments, which must
be uniquely indexed. The physical block address of the compressed extent
would suffice for this purpose. So a tail fragment attribute header would
contain, in addition to attribute type and version:
* physical block address of compressed extent (lookup key)
* fragment offset within tail block
* fragment length
The fragment data follows immediately.
During the usual attribute scan at inode open time, fragments will be
reassembled and cached until demanded by read access to a non-present cache
block backed by the associated compressed extent. The blocks of the
compressed extent less the tail fragment will be read into an intermediate
buffer (the volmap cache may serve for this purposed. Then the compressed
extent including the reassembled tail fragment will be decompressed into
the file page cache.
A weakness of this scheme is that an inode that includes many compressed
extent tail fragments may become inconveniently large, spanning many inode
tree blocks and increasing inode load and save times. A simple resolution
is to limit the number of tail fragments stored, then fall back to storing
tails together with the rest of the compressed extent, just the case
where no tail fragment packing scheme is available, with consequently
degraded compression ratio.
As a more elaborate but superior approach, a new kind of fixed length
attribute, a "fragment tree", could be an extent tree, similar to a dtree,
where fragments are stored end to end in each referenced extent. This would
avoid expending work on tail fragments until actually demanded, and would
not require the entire structure to be rewritten if only one of the extents
is affected by an update.
It is apparent that improving compression ratio modestly by storing tail
fragments compactly requires nontrivial implementation work. However, the
gains will most probably justify the effort.
With compression, data no longer moves directly from cache to media,
instead it is compressed or decompressed using an intermediate buffer, then
sent to its final destination. Compression also changes the size of data,
breaking the normal one-to-one correspondence between cache blocks and media
By way of review, map_region is responsible for mapping logically mapped
file cache IO requests to media blocks. Map_region itself does not perform
data IO, but may read metadata as necessary into volume cache. Map_region
takes a file cache region as input and returns some number of physical
media extents as a result. For initial write or nondestructive rewrite,
these extents are newly allocated from volume free space, while for read or
destructive rewrite, existing extents keyed by logical file address are
retrieved from cached file metadata. For write, extent metadata is added
for newly allocated extents and preexisting extents are resized or removed
as necessary. For read, cache blocks not covered by existing extent
mappings are reported as holes.
Transparent compression imposes additional requirements on map_region.
Unlike normal media extents, compressed extents are indivisible in
the sense that an extent must be available in its entirety in order to
decompress it. For write, any partially overwritten compressed extents must
be read into cache and partially recompressed, which introduces a new
requirement for read before write.
Compression is done at delta marshal time, for all dirty inodes in a delta.
Memory lifetime of compressed data is from marshal to delta complete.
Therefore a common, expandable buffer may be used to hold compressed extent
blocks. Each compressed write bio references this intermediate buffer rather
than the original file cache pages. Read before write is required in the
case of rewrite within an existing compressed extent.
Under high fragmentation conditions, a single compressed extent may be
split across multiple physical extents. We may ignore this possibility
in our initial implementation, and accept the risk of allocation
failure under rare fragmentation conditions. Eventually, multiple physical
extents per compressed extents should be supported in order to guarantee
full utilization of a fragmented volume.
When multiple physical extents per compressed extent are allowed,
map_region needs to know which physical extents belong to a given
compressed extent. No new metadata is strictly required for this: the
logical length of all but the last of the dleaf entries for the extent will
be zero (in dleaf2 format, the logical offsets are equal). However, it may
be expedient to explicitly mark "continuation" extents in order to
facilitate finding the first extent.
we already replace the core kernel dirty inode flushing mechanism with our
own because we cannot tolerate core's current behavior of flushing inodes
effectively randomly without reference to our consistency requirements, but
still use core's multi-page buffer read scheme, which interfaces to a
filesystem via the ->get_block hook. However, this traditional API is
inadequate for transparent compression because of the assuption that we
want the mapped data to be read into into file cache. In fact, we must read
the data into some temporary buffer before decompressing into file cache.
Unlike the write case, cache reads take place as frontend activity at
unpredictable times. We might allocate and free such temporary read buffers
on demand per compressed read, or we might operate our own pool, if that is
A large read might cover many compressed extents. A simple, synchronous
read, decompress, copy loop would block the CPU on IO wait when it could
be doing decompression work or resolving other read mappings. Instead,
our read mechanism should be content just to submit all required reads,
with decompression and cache loading work being scheduled by read
completions and done in a worker thread.
<to be continued>
More information about the Tux3