Re: FSFS successor ID design draft

From: Daniel Shahaf <danielsh_at_elego.de>
Date: Sat, 27 Aug 2011 02:15:31 +0300

[ Added after reading through and replying to everything: overall this
seems good in a detail-by-detail reading[1]; please commit. ]

[1] the 'mile-high reading' I didn't do, but I think the last two days
of IRC discussions are a fine replacement to it.

Stefan Sperling wrote on Fri, Aug 26, 2011 at 13:14:32 +0200:
> Below is an initial draft of a design for successor-IDs in FSFS.
> It is a bit long, but I hope it covers all relevant points in
> sufficient detail.
>
> Please share your thoughts on this. I have most likely missed a few things.
>
> If we like it I will put it into the repository for further editing,
> either in notes/ on trunk or onto the fsfs-successor ID branch.
>

Add it as sibling of 'structure' on the branch? That's its ultimate home.

> If this design has some severe problem I don't mind going back to
> the drawing board. This is my first shot, afterall :)
>
> cmpilato, does this significantly differ from what you've done for BDB?
> Will both backends have the same (or similar) behaviour if we use
> this design for FSFS?
>
> Thanks to neels and danielsh for all your input so far.
> You've been tremendously helpful!
>
> I would also be happy to see alternative proposals.
>
>
>
> FSFS successor IDs
> ==================
>
> See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
> for what this is all about.
>
> What follows are ideas about implementing successor ID support for FSFS.
>
> Useful background information:
> Node-revisions: subversion/libsvn_fs_base/notes/fs-history
> subversion/libsvn_fs_base/notes/structure
> fsfs design: subversion/libsvn_fs_fs/structure
>
>
> The "cache" and its properties
> ==============================
>
> Storage of successor IDs is essentially a cache of the result of the
> inversion of the predecessor relation between all node-revisions.
> This inversion is a *very* expensive operation (follow every node
> that ever existed in the repository from its most recent revision
> back to the revision it was created in).
>
> In the following, "cache" refers to the store for successor IDs.
>
> After a general introduction, a proposed cache design is presented.
> Note that the cache implementation will be hidden behind the API of
> libsvn_fs_fs and can be improved or changed in every new minor
> release of Subversion.
>
> Required properties of successor ID cache for FSFS:
> - FSFS-compatible:
> - cache must use some plain file format
> - writers do not block readers (i.e. sqlite is out of the question)
> - cache must not rely on modifying existing revision files
> - cache can be regenerated on demand

Regenerated *offline*, right? ie, if the cache is lost it can be
derived from the revision files (which remain authoritative), but this
process need not withstand concurrent readers. (The problem being the
need to decrement the cached-rev file)

Another point:
- cache readers know how complete the data they read is

> - do not corrupt the cache in case of an unexpected crash
> - try to keep number of directory entries low
> - conserve inodes by design (preferred), or via packing support
>
> Nice-to-have properties:
> - the repository can be live during cache generation

   - cache can be constructed incrementally
     (ie, amortize the construction among many readers/writers, rather
     than require a full history crawl to be done all at once)
>
>
> Storage size requirements
> =========================
>
> Ballpark figures about node-revs from #svn-dev:
> * cmpilato does some math. There are over 350,000 node revisions in my
> ancient copy of the svn.collab.net Subversion repos.
> <hwright> there are over 20M of them in the ASF repo
>
> How many successors does a given node-rev have?
>
> There is at most one successor on the same line of history (same copy_id).
> Each copy operation also creates one new successor.
>
> How big will the cache become, in total?
>
> The successor cache does not store entire node-revs, just node-rev IDs.
> These are generally not longer than 40 bytes (less than half a line
> in a 80x25 terminal being the rough estimate).
>
> The amount of raw information to store per node-revision is the
> size of this mapping:
> node_rev_id => successor_id
>
> Each entry is at least 80 bytes large, 40 bytes for the node_rev_id
> and 40 bytes for the successor ID.
>
> Each new node-revision adds one entry to the map.
> Since each successor is itself a node-revision, the entire cache
> has at most as many entries as the amount of node-revs in the
> repository.
>

(Technically you can say "strictly less" entries here, since r0 always
contains a noderev that is not the successor of anything.)

> More precisely, the amount of successors listed in the cache is
> equal to the number of node revisions in the repository, minus
> the number of nodes in HEAD (which don't presently have successors)
> and minus the number of nodes which were deleted in history (they
> don't have successors either).
>

Do you mean "nodes" or "noderevs"?

Why are you subtracting the number of deletes?

> In other words, the amount of successors in the cache is equal to
> the amount of pred: entries in revision files.
>
> Let's try to plug in some numbers from the above svn.collab.net case.
>
> There are 350000 node revs.
> maximum size of successor IDs in cache = 40 bytes * 350000
> maximum size of node-rev IDs in cache = 40 bytes * 350000
> maximum size of cache = 40 bytes * 350000 * 2
>
> This amounts to roughly 28MB worst-case successor cache size for
> the svn.collab.net repository (which itself is about 500MB in size).
>
>
> Implementation proposal
> =======================
>
>
> - On-disk File Format -
>
> All files related to the cache are stored in the
> repos/db/successor-id-cache directory ("cache directory").
>
> There is one file which contains the number of the revision the
> cache was last synced to: repos/db/successor-id-cache/cached-rev
> This is used by both readers and writers.
>
> The directory also contains files named after revisions,
> and is sharded modulo some value TBD:
>
> successor-id-cache/0/1
> successor-id-cache/0/2
> successor-id-cache/0/3
> ...
> successor-id-cache/1/N
> successor-id-cache/1/M
> ...
>
> These individual files store successors of all node-revs of a given revision
> (or, optionally, a number of revisions -- see notes on packing below).
>
> Each file starts with an index, followed by zero or more successors-blocks
> ("s-block" for short).
>
> The format of the index is shown below. It contains the index size
> (which is a fixed value) and one entry for each node_rev that appears
> in the revision(s) the file is responsible for.
>
> The index size and offsets are 64 bit values.

Stored as? ASCII decimal? (If so: how is it delimited or padded?) Big
endian binary?

> (32 bit index size would probably be sufficient, but let's not count peanuts.)
> node_rev_id is a variable length ASCII string terminated by '\0'.
>

(That's consistent with the use of noderev id's in noderev headers in
the revision files; a header line is delimited by \n.)

Just to clarify: both offsets are to the *first* byte of the s-block,
correct?

Are the offset relative to the file or to the end of the index?

> | ... |
> +----------------------------------------------------------------------+
> | zero padding to multiple of s-block size |
> +----------------------------------------------------------------------+

You pad [index size, array_of[noderevid offset offset]] rather than
just array_of[noderevid offset offset]. Correct?

(ie, sizeof(index size) + sizeof(array) + sizeof(padding) is what has to
be a power of 2, rather than only the sum of the last two addends.)

What is the size of an s-block?

>
> If a node-rev does not have any successors, both s-block offsets are zero.
> This makes finding the answer to the question "does this node have any
> successors?" an O(1) operation (only one disk block needs to be read

[[colon] the index block]

> ).
>
> If none of the node-revs have successors yet, the file ends here.
>
> If a node-rev has successors, the file contains one or more s-blocks.

Corollary: the cache is updated after 'current' has been bumped,
finishing a commit.

> An s-block lists

[the noderev id's of ]

> successors of the node-rev and provides some amount of
> pre-allocated space for future successors. The offset to the first s-block
> helps readers, the offset to the last s-block helps writers (the next
> section provides details).
>
> The padding at the end of the index ensures that all s-block boundaries
> lie on disk sector boundaries (more details later).
>
>
> The format of an s-block is shown below. It has a fixed size (SBLOCK_SIZE).
> The offsets are 64 bit values. successor_id is a variable length ASCII
> string terminated by '\0'.
>

(same serialization questions as above)

> +----------------------------------------------------------------------+
> | offset of next s-block |

Perhaps store the length of this s-block? That seems more robust to me.

> +----------------------------------------------------------------------+
> | offset of free space in this s-block |

Offset relative to what?

(Please always say what offsets are relative to... that's as needed as
giving the units of measured quantities.)

> +----------------------------------------------------------------------+
> | successor_id, successor_id, ... |

What is the separator between successor_id's?

(Perhaps it's 0x2C 0x20, but that's not quite clear from the text.)

> +----------------------------------------------------------------------+
> | free space zero-padded to SBLOCK_SIZE |
> +----------------------------------------------------------------------+
>
> What is a good value for SBLOCK_SIZE?
> Modern disks have a block size of 8192 bytes (some disks even use this
> block size internally but report 512 bytes of block size to the OS).
> Using this block size means that each s-block will fit into one sector
> on modern disks, and into 16 sectors on older disks using 512 byte sectors.
> In both cases an s-block can be read in a single sequential read.
>
> Recall that a successor_id will take about 40 bytes. Let's round that
> up to 64 for good measure. Applying advanced mathematics tells us that

(including any commas; per the above question)

> 8192 bytes will provide enough space for at least 128 successors of a
> node-rev-id. Which means we'll have to allocate a new s-block about every
> 128 successors

128 is the wrong figure, you overlooked the two offsets at the top of
the s-block.

> (usually a slightly higher amount of successors should fit).

Indeed, assuming that the successor_id's list has variable-sized
entries. (ie, that the sucessors' noderev id's are not padded.)

> If more successors must fit SBLOCK_SIZE should be some multiple of 8192.
>

Multiple of 2**13? Just plain old 'power of 2' isn't good enough?

> If this s-block is the last s-block for this node-rev, the offset of
> the next s-block is zero.
>
> The offset of free space within the s-block is used by readers when
> reading existing successors and by writers when appending new successors.
> (Note that in the implementation the offset of free space might also
> be a 16 bit number relative to the start of the s-block, instead of
> being a 64 bit number. But let's ignore such optimisations for now.)
>
>
>
> - Procedure for Readers -
>
> A reader first reads the cached_rev file and remembers the revision number.
>
> It then locates the cache file corresponding to the given node_rev_id.
> The name of the file is keyed on the revision of the node-rev. (Without
> packing the file has the same name as the revision. With packing the
> file is named after this revision modulo some fixed and known value.)
>

Point being: filename = deterministic function(revnum)

> It opens this file and reads the index to learn the offset of the
> first s-block of the node_rev_id.
>

[ I point out later, with packing, things aren't that simple, as the
function isn't deterministic; it's two-valued. ]

> If the offset is zero, the node-rev has no successors, and the search ends.
>
> The reader seeks to the offset of the first s-block.
>
> It reads the offset of free space within the s-block, and then reads
> successors from the s-block up to this offset.
>
> If the offset to the next s-block is zero, the search ends.
> If the offset to the next s-block is non-zero, the reader
> proceeds to the next s-block.
>

I suggested earlier that "next s-block" offsets be relative. But if
they're absolute, I'd suggest here to check that they are later in the
file (to protect against infinite loops).

> When the search ends, the reader returns the list of successors
> and the number of the cached_rev, which indicates up to which
> revision the successor list is valid. (This is a lower bound; the
> successor list might already contain some values for newer revisions
> if a writer is concurrently updating the cache. However, there is no
> harm done if the reader uses this information because it pertains to
> already committed revisions.)
>

Repeating for the list what I mentioned on IRC:

I envision that, in addition to this low level API that returns the
results up to some revision which it also returns, we'll have a wrapper
API (in fs-loader.c?) that, given a revnum rN, (a) calls the low-level
API, obtaining the successors up to some rM; and (b) uses other APIs ---
svn_fs_replay()? --- to compute (expensively) the successors from rM+1
to rN.

[ And how does the higher-level API deal with a situation where the
cache is a far cry behind HEAD? It clearly shouldn't do a full history
crawl by default, and there ought to be numerous ways to achieve that. ]

> Note that there are only O(number of s-blocks) disk seeks involved.
> Any time data is read from disk one or more disk blocks are read sequentially.
>

Oh, so the s-blocks for a given noderev might not all be contiguous?
Okay.

>
>
> - Procedure for Writers -
>
> Note that the following describes how successor data for a single
> node_rev_id is updated. This is only a sub-step of the global cache
> synchronisation procedure (which is described in the next section).
>
> A writer operates under the assumption that the entire cache directory
> has been locked for writing.
>
> The writer locates the cache file corresponding to the given node_rev_id.
>
> The writer opens this file and reads the index to learn the offset of the
> last s-block of the node_rev_id.
>
> The writer seeks to the offset of the last s-block.
>
> The writer reads the offset of remaining free space within the s-block,
> and calculates the amount of free space left within the s-block (recall
> that the s-block has fixed size SBLOCK_SIZE).
>
> The writer writes successor IDs to free space within the s-block,

Q: How do you perform the writes in a manner safe to concurrent readers?

A: using the 'offset of free space', below.

Q: How do you update the 'offset of free space' in a manner that is
consistent to readers?

ie, if the offset was ASCII ' 800' and the new offset is '1600', you
must avoid a reader accidentally seeing '1800'.

(I seem to remember some ideas exchanged on the revprop-packing design
discussion --- including one that involved having two offsets written
down in the file, and one of them *always* being valid (but not always
the same one) --- perhaps Peter will remember...)

> and flushes the cache file to disk.
>
> The writer updates the offset of free space in the s-block, and
> flushes the cache file to disk.
>
> (At this point readers get to see the new successor IDs.)
>
>
> At any point, if remaining free space is too small for all new successors,
> the writer must allocate a new s-block.
>

So, invariant: given a node-rev, all its s-blocks, except perhaps the
last (in order of appearance in the file), have at most <size of the
longest possible node-rev-id, minus one> free (zero) bytes at the end.

> To do so, it seeks to the very end of the file and appends a new s-block
> which initially contains only zeros.
>
> The writer writes successors to free space in the new s-block.
> It updates the offset to free space in the new s-block, and flushes
> the cache file to disk.
>

(If the new s-block runs out, loop and allocate yet another new
s-block.)

> It updates the 'next s-block' offset in the previous s-block to the
> offset of the new s-block, and flushes the cache file to disk.
>

At what point does the writer update the cached-rev file?

>
> If the writer crashes, the following cases can happen:
>
> - New successors have been added to an s-block, but the offset to
> free space has not been updated.
> => Newly added successors will be overwritten by the next writer,
> and readers don't see them.
> - A new s-block has been written, but the 'next s-block' offset at
> the previous s-block has not been updated.
> => An unused s-block will remain in the file.
> The next writer adding successors of the same node-rev will
> append a new s-block to the file. Readers don't see the unused s-block.
> If the cache is rebuilt from scratch the unused s-block disappears.
>

So, a lost block is in the file. You could avoid that by seeking, not
to the end of the file but to

SELECT (MAX(offset to my last s-block) + SIZEOF(s-block)) FROM index

(and, yes, I realize that this is a crash situation. two answers to
that: first, it's easy to avoid having an unused block; second, if
a file grows an unused block then a DOS can translate to running out of
disk space.)

> Thus, cache consistency is guaranteed even in case of crashes.
>
>
>
> - Cache Synchronisation -
>
> The procedure for synchronising the cache is as follows.
> Note that it assumes a write lock on the cache is held.
>
> This procedure is run during post-commit processing after 'current'
> has been updated. It can also be run manually via svnadmin.
>

(it should be run manually if the precedure failed to run during
post-commit FS processing for any reason)

> sync_successor_id_cache() {
> open 'cached-rev' file
> read $cached_rev from 'cached-rev' file
>
> open 'current' file
> read $new_cached_rev from 'current' file
> close 'current' file
>

/me notes the order the two files are read

> # The new_successors hash table is keyed by a node_rev_id.
> # Its values are lists of successor IDs.
> #
> # The on-disk cache file format is designed for batch updates.
> # It is inefficient to add new successors to cache files one by one.
> # To update cache files efficiently, this table will grow up to a certain
> # upper bound and be flushed to disk periodically.
> new_successors = {}
>
> for rev in [$cached_rev, ..., $new_cached_rev] {
>

for rev in [$cached_rev+1, ..., $new_cached_rev] {

> for node_rev in $rev {
> obtain predecessor p_node_rev of $node_rev
> obtain revision p_rev from $p_node_rev

Should the cache note which of the successors is the M-successor (the
one with the same copy-id)? Perhaps by stipulating that the M-successor
is always the first in the list of successors, and leaving room for it
if a C-successor (a copy) is created before the M-successor.

> if cache file $p_rev exists {
>
> # this uses the 'reader' procedure as explained above
> look up $p_node_rev:$node_rev in cache file
>
> if $p_node_rev:$node_rev is already listed
> continue with next node_rev
> else
> new_successors[$p_node_rev] += $node_rev
> } else {
> new_successors[$p_node_rev] += $node_rev
> }
> }
>
> if space required by new_successors reached upper bound {

or if we are on the last iteration on the loop
(to avoid the code duplication later)

> for node_rev in new_successors.keys() {
> # update_cache_file() implements the 'writer' procedure explained above
> update_cache_file($node_rev, new_successors[$node_rev])
> }

So, you scan from (cached_rev + 1) to (youngest) and group the writes by
the predecessor's revision. (Or perhaps by the predecessor's revision shard, if
you're to write a pack file directly?) Okay. The tradeoff is the next
line --- that cached-rev can only be updated after all those revision's
new noderevs have been added to the cache, without checkpoints in the
middle of that range.

And the 'upper bound' addresses the case when (youngest - cached_rev) is
large. Okay.

(It does mean we have two codepaths though, eg both packed and
non-packed shards; and <cue wild idea> perhaps it'll be easier to have
*only* packed shards, and to use the expensive replay method to answer
queries about revisions younger than the latest pack file? (We may want
to use a smaller shard size in this case.)

And, by the way, this idea isn't /so/ unwieldy. I think that you can
use *only* pack files --- and also use pack files for unfinished shards
--- if, when you run out of INDEX_SIZE, you rewrite the whole file and
move-into-place a new pack file (with a doubled INDEX_SIZE) on top of
the old pack file. The advantage here would be not having the
both-packed-and-nonpacked-shards headache to deal with.)

> write $rev into 'cached-rev' file
> flush 'cached-rev' file

No, you don't flush, you move-into-place.

> new_successors = {}
> }
> }
>
> if new_successors != {} {
> for node_rev in new_successors.keys() {
> # update_cache_file() implements the 'writer' procedure explained above
> update_cache_file($node_rev, new_successors[$node_rev])
> }
> write $new_cached_rev into 'cached-rev' file
> }
>
> close 'cached-rev' file
> }
>
>
> If this code crashes, the following can happen:
> - The cache contains successors computed for revisions beyond the
> revision listed in the 'cached-rev' file.
> => The next writer will not add these successors again. No harm done.

Or perhaps one of the successors was partially added, in which case we
fall back to the "the 'free space' offset hadn't been updated" point
from above. (and the bytes would be overwritten by identical bytes,
most likely)

I think this is fine, but I'd like to read it again to be surer.

>
>
> - Packing -
>
> The cache can optionally be packed. This means that successors of
> node_revs from more than one revision live in the same cache file.
>

Okay. I missed your terminology at first, but I see now that you mean
'sharding' and 'packing' in exactly the same sense they are used today
for revision files.

> The on-disk format and cache synchronisation procedure does not change.
> The only thing that changes is the procedure to look up the name of the
> cache file responsible for a given node-rev. The cache file is named
> after the node-rev's revision modulo some fixed and known value.
>

This is different from how revision shard files are named: $FS/revs/8/8810

> To create the index of a packed cache file the combined number of
> node-revs in all revisions that share the same cache file must be known.
> Thus, a packed cache file can only be created after all revisions
> which share the cache file have been committed.
>
> Packing can be performed with 'svnadmin pack'.
>

Okay, so the cache reading procedure MUST change, to account for the
non-packed cache file vanishing (by a concurrent packer) and the pack
file appearing instead. This is another layer of headache on top of the
base one :-(

[ but see above for an idea that avoids this headache by using pack
files for unfinished shards too ]

>
> - How does this design live up to our requirements? -
>
> - FSFS-compatible:
> - cache must use some plain file format
> => yes
> - writers do not block readers
> => yes, a reader will get consistent data even while the cache
> is being written to
> - cache must not rely on modifying existing revision files
> => yes
> - cache can be regenerated on demand
> => yes

- cache readers know how complete the data they read is
=> yes, via the returned their value of cached-rev

> - do not corrupt the cache in case of an unexpected crash
> => the cache can only contain more information than claimed by
> the 'cached-rev' file
> - try to keep number of directory entries low
> => yes, via sharding
> - conserve inodes by design (preferred), or via packing support
> => yes, via optional packing
>
> - the repository can be live during cache generation
> => yes, but readers will not see all successors until the cache
> is fully populated

   - cache can be constructed incrementally
     (ie, amortize the construction among many readers/writers, rather
     than require a full history crawl to be done all at once)

Well, it /can/ be constructed with the smallest increment being "one
revision of successors", and that's what the post-commit FS processing
will do. The same procedure when run manually is optimized for large
revision ranges; that is a useful property for the initial construction
of a cache.
Received on 2011-08-27 01:16:22 CEST

This message: [ Message body ]
Next message: Daniel Shahaf: "Re: FSFS successor ID design draft"
Previous message: Philip Martin: "Re: 1.7.0-rc2 up for signing / testing"
In reply to: Stefan Sperling: "FSFS successor ID design draft"
Next in thread: Stefan Sperling: "Re: FSFS successor ID design draft"
Reply: Stefan Sperling: "Re: FSFS successor ID design draft"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]