Re: FSFS successor ID design draft

From: Stefan Sperling <stsp_at_elego.de>
Date: Mon, 29 Aug 2011 18:01:38 +0200

On Sat, Aug 27, 2011 at 02:15:31AM +0300, Daniel Shahaf wrote:
> Add it as sibling of 'structure' on the branch? That's its ultimate home.

Fair enough.

> > - cache can be regenerated on demand
>
> Regenerated *offline*, right? ie, if the cache is lost it can be
> derived from the revision files (which remain authoritative), but this
> process need not withstand concurrent readers. (The problem being the
> need to decrement the cached-rev file)

Well, decrementing the cached-rev file would have to be done offline.
But the cache can be rebuilt while the repository is active.

We could have a flag that says whether the cache is active.
E.g. after an upgrade from an older repository format, the cache
would be flagged inactive, until it has been populated.

When rebuilding the cache we'd take the repository offline,
mark the cache inactive, remove the cached-rev file, go online,
and rebuild the cache. Readers will get information for older
revisions only until the cache has been rebuilt.

> > More precisely, the amount of successors listed in the cache is
> > equal to the number of node revisions in the repository, minus
> > the number of nodes in HEAD (which don't presently have successors)
> > and minus the number of nodes which were deleted in history (they
> > don't have successors either).
> >
>
> Do you mean "nodes" or "noderevs"?

This should say "node-revs in HEAD".

> Why are you subtracting the number of deletes?

Oh. Well, that part is wrong and should be removed.

> > The index size and offsets are 64 bit values.
>
> Stored as? ASCII decimal? (If so: how is it delimited or padded?) Big
> endian binary?

Some binary format. Fixed size.

Yes.

> Are the offset relative to the file or to the end of the index?

Haven't made up my mind about yet.

For this first draft, consider all offsets relative to the start
of the cache file. However, we can still change this.

> > | ... |
> > +----------------------------------------------------------------------+
> > | zero padding to multiple of s-block size |
> > +----------------------------------------------------------------------+
>
>
> You pad [index size, array_of[noderevid offset offset]] rather than
> just array_of[noderevid offset offset]. Correct?

Yes, the entire index fits into a multiple of the block size.

> What is the size of an s-block?

A constant. See later in the text.

> > +----------------------------------------------------------------------+
> > | offset of next s-block |
>
> Perhaps store the length of this s-block? That seems more robust to me.

The s-block length is constant.

> > +----------------------------------------------------------------------+
> > | successor_id, successor_id, ... |
>
> What is the separator between successor_id's?
>
> (Perhaps it's 0x2C 0x20, but that's not quite clear from the text.)

Some value, whatever. E.g. '\0'.

> > 8192 bytes will provide enough space for at least 128 successors of a
> > node-rev-id. Which means we'll have to allocate a new s-block about every
> > 128 successors
>
> 128 is the wrong figure, you overlooked the two offsets at the top of
> the s-block.

Yes, it's just an approximation.

> > (usually a slightly higher amount of successors should fit).
>
> Indeed, assuming that the successor_id's list has variable-sized
> entries. (ie, that the sucessors' noderev id's are not padded.)

Yes, they are variable size.

> > If more successors must fit SBLOCK_SIZE should be some multiple of 8192.
> >
>
> Multiple of 2**13? Just plain old 'power of 2' isn't good enough?

Not if we want the blocks to align with disk blocks.

> [ And how does the higher-level API deal with a situation where the
> cache is a far cry behind HEAD? It clearly shouldn't do a full history
> crawl by default, and there ought to be numerous ways to achieve that. ]

This is tricky. We need some heuristic.

Though Stefan^2 points out that rebuilding the cache might not be as
expensive as we thought it would be.

> Oh, so the s-blocks for a given noderev might not all be contiguous?

Exactly.

> Q: How do you update the 'offset of free space' in a manner that is
> consistent to readers?
>
> ie, if the offset was ASCII ' 800' and the new offset is '1600', you
> must avoid a reader accidentally seeing '1800'.

The offsets appear at the start of a disk block.
They should be a fixed-sized binary value -- ASCII has problems
as you've pointed out.

In any case, I am assuming that flushing data to disk flushes one disk
block. I.e. we can update one block atomically during flush.
Maybe this assumption isn't valid.

> (I seem to remember some ideas exchanged on the revprop-packing design
> discussion --- including one that involved having two offsets written
> down in the file, and one of them *always* being valid (but not always
> the same one) --- perhaps Peter will remember...)

Do you have a link to the archives?

> So, invariant: given a node-rev, all its s-blocks, except perhaps the
> last (in order of appearance in the file), have at most <size of the
> longest possible node-rev-id, minus one> free (zero) bytes at the end.

Did you mean "at least"?

> > It updates the 'next s-block' offset in the previous s-block to the
> > offset of the new s-block, and flushes the cache file to disk.
> >
>
> At what point does the writer update the cached-rev file?

The writer doesn't update the cached-rev file. This is done by
the global cache sync code.

I forgot to include two steps for the writer, though.
It needs to update the offset to the last s-block in the index

> So, a lost block is in the file. You could avoid that by seeking, not
> to the end of the file but to
>
> SELECT (MAX(offset to my last s-block) + SIZEOF(s-block)) FROM index
>
> (and, yes, I realize that this is a crash situation. two answers to
> that: first, it's easy to avoid having an unused block; second, if
> a file grows an unused block then a DOS can translate to running out of
> disk space.)

I don't really understand you here.
Are you suggesting that the code should check for a dead s-block
first, before appending to the file?

That MAX doesn't make sense. Each node_rev_id only appears once
in the index. Maybe that wasn't clear yet? There is exacrty one entry
for each node-rev-id from revision N in the cache file for revision N.

> > for node_rev in $rev {
> > obtain predecessor p_node_rev of $node_rev
> > obtain revision p_rev from $p_node_rev
>
> Should the cache note which of the successors is the M-successor (the
> one with the same copy-id)? Perhaps by stipulating that the M-successor
> is always the first in the list of successors, and leaving room for it
> if a C-successor (a copy) is created before the M-successor.

We don't know how long the M-successor ID is going to be.

If we want to optimise locating the M-successor, we could add the
offset to the s-block which contains the M-successor to the index,
and use a special flag byte to terminate the M-successor string.

> (It does mean we have two codepaths though, eg both packed and
> non-packed shards; and <cue wild idea> perhaps it'll be easier to have
> *only* packed shards, and to use the expensive replay method to answer
> queries about revisions younger than the latest pack file? (We may want
> to use a smaller shard size in this case.)
>
> And, by the way, this idea isn't /so/ unwieldy. I think that you can
> use *only* pack files --- and also use pack files for unfinished shards
> --- if, when you run out of INDEX_SIZE, you rewrite the whole file and
> move-into-place a new pack file (with a doubled INDEX_SIZE) on top of
> the old pack file. The advantage here would be not having the
> both-packed-and-nonpacked-shards headache to deal with.)

Might be possible. Not sure if that's worth it.

> > write $rev into 'cached-rev' file
> > flush 'cached-rev' file
>
> No, you don't flush, you move-into-place.

I decided to avoid move-into-place on purpose.

This is my biggest question:

Does FSFS presently move stuff into place if the move target location
already exists? I am asking because I don't think this would work on
Windows. We cannot overwrite a file if a reader has opened it.
Or do we really use the delete-retry-loop on Windows servers?

The only situation I can think of is where a previous commit created
a revision file and didn't update 'current'. In which case it's unlikely
that the revision file will be open when the next commits moves a file
with the same name into place.

> Okay, so the cache reading procedure MUST change, to account for the
> non-packed cache file vanishing (by a concurrent packer) and the pack
> file appearing instead. This is another layer of headache on top of the
> base one :-(

Do we support packing while the repository is live?
I haden't considered this.

Thanks for your feedback!
Received on 2011-08-29 18:02:27 CEST

This message: [ Message body ]
Next message: Danil Shopyrin: "Re: FSFS successor ID design draft"
Previous message: C. Michael Pilato: "Re: Recurse into same-repos externals at commit time."
In reply to: Daniel Shahaf: "Re: FSFS successor ID design draft"
Next in thread: Daniel Shahaf: "Re: FSFS successor ID design draft"
Reply: Daniel Shahaf: "Re: FSFS successor ID design draft"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]