On 13.12.2010 16:13, Hyrum K. Wright wrote:
> On Sun, Dec 12, 2010 at 9:23 AM,<stefan2_at_apache.org> wrote:
>> Author: stefan2
>> Date: Sun Dec 12 15:23:18 2010
>> New Revision: 1044833
>>
>> URL: http://svn.apache.org/viewvc?rev=1044833&view=rev
>> Log:
>> Get some ideas for FSFS format improvements written down.
>>
>> * notes/fsfs-improvements.txt: new file
>>
>> Added:
>> subversion/trunk/notes/fsfs-improvements.txt (with props)
>>
>> Added: subversion/trunk/notes/fsfs-improvements.txt
>> URL: http://svn.apache.org/viewvc/subversion/trunk/notes/fsfs-improvements.txt?rev=1044833&view=auto
>> ==============================================================================
>> --- subversion/trunk/notes/fsfs-improvements.txt (added)
>> +++ subversion/trunk/notes/fsfs-improvements.txt Sun Dec 12 15:23:18 2010
>> @@ -0,0 +1,224 @@
>> +Introduction
>> +------------
>> +
>> +The FSFS external data format can be changed to allow for
>> +significantly reduced I/O overhead without changing the
>> +fundamental ideas behind FSFS.
>> +
>> +The whole idea is to rearrange packed revision info in a
>> +new FSFS format "6". A two-way conversion between format
>> +"5" and "6" should be possible as well as supporting both
>> +formats for different repositories on the same server.
> Generally, I like these ideas.
>
> As I understand it, they would be implemented as part of the packing
> process, or the pack files would be rewritten as part of the repos
> format upgrade.
Yes.
> This could get a bit tricky, since FSFS depends on
> static file offsets for referencing data within revision files. Not
> impossible, just tricky.
Certainly not straightforward but not too bad either:
the size of all fragments is known / can be known in
advance without difficulty. As long as we allow the
packing process to keep a full in-memory table of
all relevant fragments, it is not too hard to calculate
the final offsets once the total fragment order has
been defined.
The only caveat is the fact that the references themselves
modify the targeted offset since they use a variable-
size encoding. But that can be dealt with as well.
>> +
>> +
>> +Revision Order
>> +--------------
>> +
>> +FSFS format "5" packs revisions by putting revision 0 at
>> +the beginning of the file and concatenating the others in
>> +ascending order. This intuitive design is counter-productive
>> +because we always trace data HEAD -> r0 and scanning a file
>> +backwards is expensive.
>> +
>> ++-------+
>> +| rev 0 |
>> ++-------+
>> +| rev 1 |
>> ++-------+
>> +
>> +: ... :
>> +
>> ++-------+
>> +| rev N |
>> ++-------+
>> +<EOF>
>> +
>> +A counter-intuitive but more efficient reversed order (de-
>> +scending revision order in a packed file) results in always
>> +reading forward through a file.
>> +
>> ++-------+
>> +| rev N |
>> ++-------+
>> +
>> +: ... :
>> +
>> ++-------+
>> +| rev 1 |
>> ++-------+
>> +| rev 0 |
>> ++-------+
>> +<EOF>
>> +
>> +
>> +Don't Interleave Revision and Content Data
>> +------------------------------------------
>> +
>> +Format "5" keeps the each revision as a single block. When
>> +re-constructing a node content from the diffs, we jump
>> +through a number of distant revision blocks. For each node.
>> +
>> ++--------------+
>> +| rev N Reps |
>> ++--------------+
>> +| rev N Header |
>> ++--------------+
>> +
>> +: ... :
>> +
>> ++--------------+
>> +| rev 1 Reps |
>> ++--------------+
>> +| rev 1 Header |
>> ++--------------+
>> +| rev 0 Reps |
>> ++--------------+
>> +| rev 0 Header |
>> ++--------------+
>> +<EOF>
>> +
>> +Instead, delta-info should be removed from the revision
>> +blocks and put at the end of the file. This speeds up
>> +header-only operations like "log" without impairing node
>> +content lookup. And it lays the foundation for further
>> +improvements.
>> +
>> ++--------------+
>> +| rev N Header |
>> ++--------------+
>> +
>> +: ... :
>> +
>> ++--------------+
>> +| rev 1 Header |
>> ++--------------+
>> +| rev 0 Header |
>> ++--------------+
>> +| rev N Reps |
>> ++--------------+
>> +
>> +: ... :
>> +
>> ++--------------+
>> +| rev 1 Reps |
>> ++--------------+
>> +| rev 0 Reps |
>> ++--------------+
>> +<EOF>
>> +
>> +The only downside is that putting headers first is somewhat
>> +more computationally expensive since offsets needs to be
>> +calculated in advance. This is made even more difficult by
>> +using a variable length encoding for those offsets.
> Does the statement about computation cost apply to the initial placement,
> or subsequent accesses? (I would guess the former.)
Only during the initial placement (i.e. packing process).
Everything else should remain "references to somewhere
in the file". We may even stick to offsets relative to the
revision header start (allows for using the same code as
for format "5" repos) instead of switching to absolute
positions.
>> +
>> +
>> +Group the Representations by Delta Tree
>> +---------------------------------------
>> +
>> +All diffs that are based on the same node within the packed
>> +revision file are put in one place. Re-constructing a node
>> +would then involve reading the revision headers (relatively
>> +close together) and the content deltas after that (again
>> +with high locality).
>> +
>> ++------------------+
>> +| rev N Header |
>> ++------------------+
>> +
>> +: ... :
>> +
>> ++------------------+
>> +| rev 1 Header |
>> ++------------------+
>> +| rev 0 Header |
>> ++------------------+
>> +| node tree A Reps |
>> ++------------------+
>> +| node tree B Reps |
>> ++------------------+
>> +
>> +: ... :
>> +
>> ++------------------+
>> +| node tree Z Reps |
>> ++------------------+
>> +<EOF>
>> +
>> +We can optimize that even further by exploiting the skip-
>> +delta information: the "walks" through the delta tree for a
>> +given node will merge bit by bit into a main "trunk". The
>> +nodes on these paths can be re-ordered such that very few
>> +seek() operations are required on average to reconstruct
>> +the node content in this file.
>> +
>> +Example of 8 changes in revs r0 .. r7 (for simplicity),
>> +looking at the delta-info only ("->" means "stored as diff
>> +against"):
>> +
>> + r0
>> + r1->r0
>> + r2->r0
>> + r3->r2
>> + r4->r0
>> + r5->r4
>> + r6->r4
>> + r7->r6
>> +
>> +Default ordering r7,r6,r5,r4,r3,r2,r1,r0 requires
>> +3/3/2/2/2/2/1/1 seeks. For 2^N changes (N integer> 0),
>> +we need (N+1)/2 seeks on average.
>> +
>> +A path-optimized ordering r1,r3,r2,r5,r7,r6,r4,r0
>> +requires 2/2/2/2/1/1/1/1 seeks. It's (N+3)/4 on average.
>> +That optimized ordering can easily be constructed by
>> +
>> + * select the highest rev not covered, yet
>> + * prepend its diff path to the existing result
>> + (until we meet a revision at already is in
>> + in the result)
>> + * repeat until all revs are covered
>> +
>> +
>> +Move the Manifest Info into the Packed File
>> +-------------------------------------------
>> +
>> +Once we mastered the offset pre-calculation issue (see
>> +above), we may as well put the manifest info at the be-
>> +ginning of the file. This will benefit "log" in particular
>> +as only one consecutive chunk of data needs to be read
>> +from only one file.
> I don't completely understand the benefits we get from this, since the
> manifest is read one and then cached. I would think that the accesses
> into the pack file would dominate I/O, and the placement of the
> manifest wouldn't matter much (though I could certainly be wrong).
In many cases, you are right. However
* there is no real benefit in keeping 2 separate files
(or maybe there is? In that case, we should keep
them separate.)
* Opening files is very expensive compared to reading
data from an open file -> keep number of fopen() call low
* Operations like "log" on a single file may crawl many
packed revision files that are otherwise rarely accessed
>> +
>> ++------------------+
>> +| rev 0 Offset |
>> ++------------------+
>> +| rev 1 Offset |
>> ++------------------+
>> +
>> +: ... :
>> +
>> ++------------------+
>> +| rev N Offset |
>> ++------------------+
>> +| rev N Header |
>> ++------------------+
>> +
>> +: ... :
>> +
>> ++------------------+
>> +| rev 1 Header |
>> ++------------------+
>> +| rev 0 Header |
>> ++------------------+
>> +| node tree A Reps |
>> ++------------------+
>> +| node tree B Reps |
>> ++------------------+
>> +
>> +: ... :
>> +
>> ++------------------+
>> +| node tree Z Reps |
>> ++------------------+
>> +<EOF>
>> +
>> +
>>
>> Propchange: subversion/trunk/notes/fsfs-improvements.txt
>> ------------------------------------------------------------------------------
>> svn:eol-style = native
>>
>>
>>
> Overall, it looks like a good plan. Thanks!
>
Thanks for the review!
-- Stefan^2.
Received on 2010-12-15 02:46:10 CET