From d.phillips at partner.samsung.com Wed Mar 5 23:47:59 2014 From: d.phillips at partner.samsung.com (Daniel Phillips) Date: Wed, 05 Mar 2014 23:47:59 -0800 Subject: Design note: Allocation Heuristics Message-ID: <5318282F.6070102@partner.samsung.com> Until now, Tux3 has relied on a simple, linear allocation strategy where we always search for free blocks just after the last one allocated. This works wonderfully well with an empty volume but is prone to fragmentation over time as deletes open up holes that tend to be filled in by unrelated blocks. A new patch set is now entering testing that attempts to provide good long term fragmentation resistance by implementing techniques analogous to those that have worked well for the Ext filesystems over the years. Similarly, we hope to control fragmentation to the point where the services of a defragmentation utility are seldom if ever required. Static versus adaptive allocation policy I expect that this work will proceed in two major stages, the first of which is the current patch set. This implements a basic "static" allocation policy where we establish an allocation goal for each inode at the time the inode is created, and that goal persists over the life of the object. Later, we will introduce "adaptive" behavior where inode allocation goals may change over time in response to observed allocation patterns. Hopefully, a simple static allocation policy will already work well enough for Tux3 to be usable as a general purpose filesystem. After all, the Ext filesystems get by very well with something not much different. But clearly, even with good initial guesses, we may want to change our mind at some point about the allocation goal for a given inode in response to changing volume congestion patterns. To support adaptive allocation, we will introduce a "goal" inode attribute to override the default "inode number equals allocation goal" rule, to be used in the case that congestion or other undesirable allocation conditions are detected. For forward compatibility, we will introduce the goal attribute before freezing the Tux3 layout definition, whether or not we are ready to use it. The rest of this note concerns the new, static allocation heuristics, which I hope will perform fairly well. Inode number to block address correspondence The big simplifying idea for the current "static" allocation policy is, block allocation goal corresponds roughly to inode number. This accomplishes two things: 1) it gives a stable goal for each inode so that if rewritten, the new blocks will tend to be allocated fairly close to the original and 2) it makes the ordering of allocated blocks similar to the ordering of inodes that own them, so a linear walk through the inode table will be a roughly linear walk through allocated blocks. This should reduce seeking on spinning media. For flash, it should help with erase block behavior. Imposing the inode number to block address correspondence reduces the volume layout problem to making good choices for inode numbers. However, it is impossible in general to make a perfect inode number choice because at creation time we do not know very much about how big a file will be, how many files there will be in a directory, or which parts of the volume will be congested or sparsely used in the future. And furthermore, all these factors can vary through large ranges over time. So we content ourselves with some assumptions: most files will be small and most directories will have a modest number of files. Some immediate weaknesses are apparent. Two large files with nearby inode numbers will have similar block allocation goals. If these files grow slowly by appending, like log files, it is likely that the two files will end up intertwined, potentially causing extra seeking. If there are many large files in the same directory, or a very large number of files in a directory, then there will be congestion in the neighborhood of that directory and lengthy linear searches might be needed for new allocations. For the time being, we will just accept these weaknesses and consider what to do about them if we observe issues in practice. Eventually, adaptive allocation and other techniques will help. Block goal extrapolation For each file write, we extrapolate an allocation goal which is simply the logical offset of the write plus the inode number, with "folding" as described below. Thus, a write to the same logical address always gives the same physical goal, whether the write is sequential or random. Tux3 never overwrites data, but always writes new data into free space. How does this "copy-on-write" interact with the goal extrapolation rule? If no free space is available nearby (common for rewrite of a large file) then the new blocks would end up far away from the original. However, if the file is rewritten again, then chances are, the out of place blocks will return to their "natural" position. If a very large file is rewritten so that the write is broken across multiple deltas, the behavior becomes more interesting. The first delta will be allocated far out of line, however it will leave a gap into which the next delta can be written. After that process covers the entire file, we would see that the entire file has moved lower, with just one region allocated far out of line. It seems as though copy-on-write effects should cause only a modest increase in fragmentation, as long as some "slack space" is distributed throughout the volume. The actual effect remain to be determined experimentally. This is one case where the ability to overwrite in place like Ext4 could be an efficiency advantage. Low level segment allocator The purpose of allocation policy is only to provide a starting point (goal) for a free space search. The low level segment allocator then provides its own, important semantics. It guarantees to search the entire volume if necessary, to be sure that all free space on the volume can actually be used. The low level allocator searches in two passes, first considering only allocation groups with enough space to satisfy all or a large part of the request, then searching all remaining allocation groups with any free space at all. Thus, there is some fragmentation avoidance built into the low level allocator. The low level allocator uses the count map table to skip over full or nearly full allocation groups efficiently. So if allocation policy does fail to provide a good allocation goal due to congestion, the result may be a long but fairly efficient search. A secondary result is potentially increased fragmentation. Our objective for this new allocation work can therefore be understood entirely in terms of reducing CPU cost of searching, and reducing fragmentation. Directory entry position to inode number correspondence Mass file operations will often be performed in directory entry order, therefore it is helpful to establish a correspondence between directory entry order and inode number. With linear allocation, this happens naturally when creating files in an empty volume, but breaks down when creating new files in an existing directory, especially if holes exist in the directory where files have been deleted. To improve behavior for such randomly created files, we let the position of a new directory entry guide the choice of inode number. We use a simple extrapolation of the directory entry logical address, and use the inode number of the parent directory as the base. This should produce good results whether the directory is extended by a new entry or a previously deleted entry in the interior of the directory is reused. There are some limitations. If an inode is hard linked or moved to a new directory, its inode number will be out of line with others in the destination directory. We also have an unnatural dependence of inode number choice on the average length of names in a directory: longer names decrease the density of the inode table. This is only a minor annoyance because Tux3 inodes are variable sized and leaving gaps between them does not cost much. In theory, we could improve this behavior by computing a statistic for average entry name length to use in the extrapolation calculation. Directory positioning As for all inodes, the allocation goal for a directory is simply its (possibly folded) inode number, which is also the base position for extrapolating all the inodes contained within the directory. Directories are extrapolated more widely than files with the intention of leaving space between directories for contained files. In a nearly empty volume, we spread directories out more widely, on the assumption that they will contain relatively more files and subdirectories. We also try to create directories in relatively empty allocation groups. The threshold for emptiness is determined by the fullness of the volume and how deeply the directory is in the filesystem tree. A directory created at root level in a nearly empty volume will be created only in a completely empty group, if one is available. The threshold for emptiness is relaxed as the volume becomes more full, and for directories deeper in the tree. The search for a relatively empty group is fairly short. If no relatively empty group is found then a linear search for a free inode number is performed, starting at the extrapolated directory inode number. For the time being, the directory group search is strictly linear, however it is probable that a pseudo random search using the extrapolated goal as a key will produce better results, a possible future improvement. Directory extrapolation is clearly a flawed mechanism. It produces congestion when the extrapolated positions of directories with different parents overlap. It also assumes a limited number of files per directory - files beyond that will overflow into the regions of other directories. For the moment, our narrow goal is to improve on the situation where directories are not spread out at all, and therefore maximally vulnerable to age related fragmentation. Use linear allocation as much as possible Simple linear allocation turns out to be the best allocation policy in a surprisingly wide variety of cases. The new patch set does not abandon it entirely. If files are being written in sequence within a delta then we use linear allocation instead of extrapolating from the inode number and file size. However, if the linear goal diverges too much from the extrapolated goal then we use the extrapolated goal and continue linearly from there. We define "too far" as something on the order of hard disk cylinder size in an effort to avoid introducing gaps of more than a track. (Within a track, gaps and out of order allocation are relatively less important because of the track cache.) The "too far" linear rule should also help with multitasking loads. If two tasks are writing in different directories simultaneously, this rule should prevent these writes from mixing their allocated blocks together. With linear allocation, if we untar a big directory, then subdirectories and their contents will always be allocated entirely inside of parent directories. This is optimal if we read in the same order and reasonably good even for breadth first access. However this arrangement will not age well because there is no slack space for creating new files close to other files in the same directory. The new patch set departs from this "embedded child" behavior to always create directories outside their parent directories. This adds two additional seeks per directory to a depth first traversal. What we hope to gain in return is much better aging behavior, a result that remains to be confirmed experimentally. Folding inode numbers to block addresses Both inode numbers and block addresses are 48 bits, so at a theoretical maximum size of one exabyte, we have one inode per volume block, which should be adequate. For smaller volumes, we may generate a volume address for any inode number by "folding" inode numbers to lie within the volume address range. The details of the folding algorithm are somewhat interesting. We take the inode number modulo a volume "wrap", which is the smallest power of two greater than the volume size, and fold again to half the wrap size if necessary, yielding a block address strictly inside the volume. This algorithm gives an inode number to volume address mapping that behaves well even if the volume is resized. If volume size is reduced, nearby block allocation goals will still be nearby. If volume size is increased, then an inode may "unfold" to a new allocation goal and its blocks will eventually migrate to the new goal when rewritten. Under many volume size changes, blocks of a given file will only take on a few different allocation goals, so fragmentation is controlled. Summary In summary, recent patches introduce the following major changes: * Inode number implies allocation goal * Extrapolate inode numbers from directory entry position * Create new directories in empty groups when volume is young * Low level extent search now driven by group counts With these changes, we expect a modest regression in terms of increased fragmentation on some loads for which linear allocation is optimal, in return for a large improvement in terms of reduced fragmentation as a volume ages. How much regression, and how much improvement remain to be determined experimentally. The current set of heuristics might need to be adjusted and can certainly be improved over time. At some point we expect to move on from static allocation goal determination to adaptive. That said, it is possible that with the current set of heuristics will already perform competitively to the point where allocation issues are no longer a barrier to using Tux3. Regards, Daniel From d.phillips at partner.samsung.com Wed Mar 5 23:59:57 2014 From: d.phillips at partner.samsung.com (Daniel Phillips) Date: Wed, 05 Mar 2014 23:59:57 -0800 Subject: Design note: Allocation Heuristics Message-ID: <53182AFD.1090508@partner.samsung.com> Until now, Tux3 has relied on a simple, linear allocation strategy where we always search for free blocks just after the last one allocated. This works wonderfully well with an empty volume but is prone to fragmentation over time as deletes open up holes that tend to be filled in by unrelated blocks. A new patch set is now entering testing that attempts to provide good long term fragmentation resistance by implementing techniques analogous to those that have worked well for the Ext filesystems over the years. Similarly, we hope to control fragmentation to the point where the services of a defragmentation utility are seldom if ever required. Static versus adaptive allocation policy I expect that this work will proceed in two major stages, the first of which is the current patch set. This implements a basic "static" allocation policy where we establish an allocation goal for each inode at the time the inode is created, and that goal persists over the life of the object. Later, we will introduce "adaptive" behavior where inode allocation goals may change over time in response to observed allocation patterns. Hopefully, a simple static allocation policy will already work well enough for Tux3 to be usable as a general purpose filesystem. After all, the Ext filesystems get by very well with something not much different. But clearly, even with good initial guesses, we may want to change our mind at some point about the allocation goal for a given inode in response to changing volume congestion patterns. To support adaptive allocation, we will introduce a "goal" inode attribute to override the default "inode number equals allocation goal" rule, to be used in the case that congestion or other undesirable allocation conditions are detected. For forward compatibility, we will introduce the goal attribute before freezing the Tux3 layout definition, whether or not we are ready to use it. The rest of this note concerns the new, static allocation heuristics, which I hope will perform fairly well. Inode number to block address correspondence The big simplifying idea for the current "static" allocation policy is, block allocation goal corresponds roughly to inode number. This accomplishes two things: 1) it gives a stable goal for each inode so that if rewritten, the new blocks will tend to be allocated fairly close to the original and 2) it makes the ordering of allocated blocks similar to the ordering of inodes that own them, so a linear walk through the inode table will be a roughly linear walk through allocated blocks. This should reduce seeking on spinning media. For flash, it should help with erase block behavior. Imposing the inode number to block address correspondence reduces the volume layout problem to making good choices for inode numbers. However, it is impossible in general to make a perfect inode number choice because at creation time we do not know very much about how big a file will be, how many files there will be in a directory, or which parts of the volume will be congested or sparsely used in the future. And furthermore, all these factors can vary through large ranges over time. So we content ourselves with some assumptions: most files will be small and most directories will have a modest number of files. Some immediate weaknesses are apparent. Two large files with nearby inode numbers will have similar block allocation goals. If these files grow slowly by appending, like log files, it is likely that the two files will end up intertwined, potentially causing extra seeking. If there are many large files in the same directory, or a very large number of files in a directory, then there will be congestion in the neighborhood of that directory and lengthy linear searches might be needed for new allocations. For the time being, we will just accept these weaknesses and consider what to do about them if we observe issues in practice. Eventually, adaptive allocation and other techniques will help. Block goal extrapolation For each file write, we extrapolate an allocation goal which is simply the logical offset of the write plus the inode number, with "folding" as described below. Thus, a write to the same logical address always gives the same physical goal, whether the write is sequential or random. Tux3 never overwrites data, but always writes new data into free space. How does this "copy-on-write" interact with the goal extrapolation rule? If no free space is available nearby (common for rewrite of a large file) then the new blocks would end up far away from the original. However, if the file is rewritten again, then chances are, the out of place blocks will return to their "natural" position. If a very large file is rewritten so that the write is broken across multiple deltas, the behavior becomes more interesting. The first delta will be allocated far out of line, however it will leave a gap into which the next delta can be written. After that process covers the entire file, we would see that the entire file has moved lower, with just one region allocated far out of line. It seems as though copy-on-write effects should cause only a modest increase in fragmentation, as long as some "slack space" is distributed throughout the volume. The actual effect remain to be determined experimentally. This is one case where the ability to overwrite in place like Ext4 could be an efficiency advantage. Low level segment allocator The purpose of allocation policy is only to provide a starting point (goal) for a free space search. The low level segment allocator then provides its own, important semantics. It guarantees to search the entire volume if necessary, to be sure that all free space on the volume can actually be used. The low level allocator searches in two passes, first considering only allocation groups with enough space to satisfy all or a large part of the request, then searching all remaining allocation groups with any free space at all. Thus, there is some fragmentation avoidance built into the low level allocator. The low level allocator uses the count map table to skip over full or nearly full allocation groups efficiently. So if allocation policy does fail to provide a good allocation goal due to congestion, the result may be a long but fairly efficient search. A secondary result is potentially increased fragmentation. Our objective for this new allocation work can therefore be understood entirely in terms of reducing CPU cost of searching, and reducing fragmentation. Directory entry position to inode number correspondence Mass file operations will often be performed in directory entry order, therefore it is helpful to establish a correspondence between directory entry order and inode number. With linear allocation, this happens naturally when creating files in an empty volume, but breaks down when creating new files in an existing directory, especially if holes exist in the directory where files have been deleted. To improve behavior for such randomly created files, we let the position of a new directory entry guide the choice of inode number. We use a simple extrapolation of the directory entry logical address, and use the inode number of the parent directory as the base. This should produce good results whether the directory is extended by a new entry or a previously deleted entry in the interior of the directory is reused. There are some limitations. If an inode is hard linked or moved to a new directory, its inode number will be out of line with others in the destination directory. We also have an unnatural dependence of inode number choice on the average length of names in a directory: longer names decrease the density of the inode table. This is only a minor annoyance because Tux3 inodes are variable sized and leaving gaps between them does not cost much. In theory, we could improve this behavior by computing a statistic for average entry name length to use in the extrapolation calculation. Directory positioning As for all inodes, the allocation goal for a directory is simply its (possibly folded) inode number, which is also the base position for extrapolating all the inodes contained within the directory. Directories are extrapolated more widely than files with the intention of leaving space between directories for contained files. In a nearly empty volume, we spread directories out more widely, on the assumption that they will contain relatively more files and subdirectories. We also try to create directories in relatively empty allocation groups. The threshold for emptiness is determined by the fullness of the volume and how deeply the directory is in the filesystem tree. A directory created at root level in a nearly empty volume will be created only in a completely empty group, if one is available. The threshold for emptiness is relaxed as the volume becomes more full, and for directories deeper in the tree. The search for a relatively empty group is fairly short. If no relatively empty group is found then a linear search for a free inode number is performed, starting at the extrapolated directory inode number. For the time being, the directory group search is strictly linear, however it is probable that a pseudo random search using the extrapolated goal as a key will produce better results, a possible future improvement. Directory extrapolation is clearly a flawed mechanism. It produces congestion when the extrapolated positions of directories with different parents overlap. It also assumes a limited number of files per directory - files beyond that will overflow into the regions of other directories. For the moment, our narrow goal is to improve on the situation where directories are not spread out at all, and therefore maximally vulnerable to age related fragmentation. Use linear allocation as much as possible Simple linear allocation turns out to be the best allocation policy in a surprisingly wide variety of cases. The new patch set does not abandon it entirely. If files are being written in sequence within a delta then we use linear allocation instead of extrapolating from the inode number and file size. However, if the linear goal diverges too much from the extrapolated goal then we use the extrapolated goal and continue linearly from there. We define "too far" as something on the order of hard disk cylinder size in an effort to avoid introducing gaps of more than a track. (Within a track, gaps and out of order allocation are relatively less important because of the track cache.) The "too far" linear rule should also help with multitasking loads. If two tasks are writing in different directories simultaneously, this rule should prevent these writes from mixing their allocated blocks together. With linear allocation, if we untar a big directory, then subdirectories and their contents will always be allocated entirely inside of parent directories. This is optimal if we read in the same order and reasonably good even for breadth first access. However this arrangement will not age well because there is no slack space for creating new files close to other files in the same directory. The new patch set departs from this "embedded child" behavior to always create directories outside their parent directories. This adds two additional seeks per directory to a depth first traversal. What we hope to gain in return is much better aging behavior, a result that remains to be confirmed experimentally. Folding inode numbers to block addresses Both inode numbers and block addresses are 48 bits, so at a theoretical maximum size of one exabyte, we have one inode per volume block, which should be adequate. For smaller volumes, we may generate a volume address for any inode number by "folding" inode numbers to lie within the volume address range. The details of the folding algorithm are somewhat interesting. We take the inode number modulo a volume "wrap", which is the smallest power of two greater than the volume size, and fold again to half the wrap size if necessary, yielding a block address strictly inside the volume. This algorithm gives an inode number to volume address mapping that behaves well even if the volume is resized. If volume size is reduced, nearby block allocation goals will still be nearby. If volume size is increased, then an inode may "unfold" to a new allocation goal and its blocks will eventually migrate to the new goal when rewritten. Under many volume size changes, blocks of a given file will only take on a few different allocation goals, so fragmentation is controlled. Summary In summary, recent patches introduce the following major changes: * Inode number implies allocation goal * Extrapolate inode numbers from directory entry position * Create new directories in empty groups when volume is young * Low level extent search now driven by group counts With these changes, we expect a modest regression in terms of increased fragmentation on some loads for which linear allocation is optimal, in return for a large improvement in terms of reduced fragmentation as a volume ages. How much regression, and how much improvement remain to be determined experimentally. The current set of heuristics might need to be adjusted and can certainly be improved over time. At some point we expect to move on from static allocation goal determination to adaptive. That said, it is possible that with the current set of heuristics will already perform competitively to the point where allocation issues are no longer a barrier to using Tux3. Regards, Daniel