Re: Fwd: Subversion Compressed Pristines Design
From: Ashod Nakashian <ashodnakashian_at_yahoo.com>
Date: Thu, 22 Mar 2012 01:44:42 -0700 (PDT)
I've combined issued from separate and removed the comments from Google docs to make things concise.Converted to plain-text for convenience.
Let me clarify. There is a trade-off between fixed-size and variable-size structures. The former is easier/faster to read, because you don't need to to any parsing. The latter is potentially more compact on disk, but requires parsing and/or lookup tables (git uses var-size and lookup tables). My original design was 40 bytes / entry due to file-size being 64-bits and pack ID being 32-bits. The reason I say shorter size is faster because I'm thinking in terms of reading the complete (or a large part) of such an index file. 5 bytes x 1M files = 5MB of data. See below for why we might need to read the full index.
Having said that, let's suspend this path and consider your next point which looks more promissing. I like the idea of having the index entries in sqlite, *provided it's not significantly slower*.
It's true that we're already read the row, and getting the relevant information in that read is probably faster than reading from a separate file on disk. I'd completely agree to moving the index file into the wc.db if we avoid any operations that span *all* entries. You see, if for some operation we need to have a map of all pack store files and their contents (for fast lookup or to find the best-fit pack store for a new/modified file) then we need one of 3 methods:
1) Read all pristine rows from wc.db and construct the neessary lookup table(s).
My approach was #3 and as such I made the entries fixed-size and small (for fast read, small memory footprint, high cache-hit rate etc.). We can make the file-size full 64-bits and still have better performance than by using the first 2 approaches.
Seems I must change the wording in the doc. The 16MB is the cut-off size to split a pack and start writing into another one. There is no reason why a pack file can't grow to TBs of size and still have 64k of them. But there must be a cutoff size otherwise splitting will never occur. The algorithm will basically split the packs until it reaches a maximum number of splits and at that point (that we've deemed more pack files are counter productive) that existing pack files will have to grow instead. It's just that I've tentitively chosen 16MB as a cutoff and 64k splits as "good starting points" and added that benchmarking will ultimately decide these values (although I think 64k files is a nice limit to try to preserve as higher numbers may require us to use a directory tree to avoid cluttering the pristine folder beyond hope - and yes, some file systems do perform rather poorly on directories that have a very large number of files).
Of course we can make the cutoff larger, but then working with a single pack store will be more time consuming (due to compression, fragmentation etc.) which will either affect speed or space. It's best to keep the pack files small and have a large number of them, then grow them as necessary rather than having large pack files and split, say, at 2GB limits. That's because most WCs are actually less than 2GB and splitting wouldn't benefit the majority.
I like the 1PB minimum. But see above on the "limitation issue".
This is an archived mail posted to the Subversion Dev mailing list.