Re: FSFS successor ID design draft

From: Daniel Shahaf <danielsh_at_elego.de>
Date: Tue, 30 Aug 2011 19:17:47 +0300

Stefan Sperling wrote on Mon, Aug 29, 2011 at 18:01:38 +0200:
> On Sat, Aug 27, 2011 at 02:15:31AM +0300, Daniel Shahaf wrote:
> > > - cache can be regenerated on demand
> >
> > Regenerated *offline*, right? ie, if the cache is lost it can be
> > derived from the revision files (which remain authoritative), but this
> > process need not withstand concurrent readers. (The problem being the
> > need to decrement the cached-rev file)
>
> Well, decrementing the cached-rev file would have to be done offline.
> But the cache can be rebuilt while the repository is active.
>
> We could have a flag that says whether the cache is active.
> E.g. after an upgrade from an older repository format, the cache
> would be flagged inactive, until it has been populated.
>
> When rebuilding the cache we'd take the repository offline,
> mark the cache inactive, remove the cached-rev file, go online,
> and rebuild the cache. Readers will get information for older
> revisions only until the cache has been rebuilt.
>

We mentioned a format number for the cache files; this "active flag"
almost smells like a format number for the entire cache... (with
format 0 being "inactive")

> > > The index size and offsets are 64 bit values.
> >
> > Stored as? ASCII decimal? (If so: how is it delimited or padded?) Big
> > endian binary?
>
> Some binary format. Fixed size.
>

Fair enough.

> > Q: How do you update the 'offset of free space' in a manner that is
> > consistent to readers?
> >
> > ie, if the offset was ASCII ' 800' and the new offset is '1600', you
> > must avoid a reader accidentally seeing '1800'.
>
> The offsets appear at the start of a disk block.
> They should be a fixed-sized binary value -- ASCII has problems
> as you've pointed out.
>
> In any case, I am assuming that flushing data to disk flushes one disk
> block. I.e. we can update one block atomically during flush.
> Maybe this assumption isn't valid.
>

ack

> > (I seem to remember some ideas exchanged on the revprop-packing design
> > discussion --- including one that involved having two offsets written
> > down in the file, and one of them *always* being valid (but not always
> > the same one) --- perhaps Peter will remember...)
>
> Do you have a link to the archives?
>

Not exactly, sorry. Most discussions happened in this (huge) thread:

    From: kmradke_at_rockwellcollins.com
    To: dev_at_subversion.apache.org
    Subject: Does fsfs revprop packing no longer allow usage of traditional backup
            software?
    Message-ID: <OFF5DC1071.793EFD7C-ON862578BF.005304DA-862578BF.00552DB1_at_rockwellcollins.com>

The crux of the idea, though, is to have two byte locations for the
offset --- say, store an integer at byte 4-7 and another at bytes 8-11
--- and have some logic to choose which one of those to read and treat
as the offset. In the meantime the other integer can be updated
arbitrarily.

> > So, invariant: given a node-rev, all its s-blocks, except perhaps the
> > last (in order of appearance in the file), have at most <size of the
> > longest possible node-rev-id, minus one> free (zero) bytes at the end.
>
> Did you mean "at least"?
>

No. I said that "All completed s-blocks have at most O(1) free bytes
at the end".

> > > It updates the 'next s-block' offset in the previous s-block to the
> > > offset of the new s-block, and flushes the cache file to disk.
> > >
> >
> > At what point does the writer update the cached-rev file?
>
> The writer doesn't update the cached-rev file. This is done by
> the global cache sync code.
>
> I forgot to include two steps for the writer, though.
> It needs to update the offset to the last s-block in the index
>

You forgot to include "two steps", but you only list one step here. Did
you forgot to list the other step which you had originally forgotten to
list?

> > So, a lost block is in the file. You could avoid that by seeking, not
> > to the end of the file but to
> >
> > SELECT (MAX(offset to my last s-block) + SIZEOF(s-block)) FROM index
> >
> > (and, yes, I realize that this is a crash situation. two answers to
> > that: first, it's easy to avoid having an unused block; second, if
> > a file grows an unused block then a DOS can translate to running out of
> > disk space.)
>
> I don't really understand you here.
> Are you suggesting that the code should check for a dead s-block
> first, before appending to the file?
>

I'd like to avoid growing 'holes' in the middle of the file that readers
will never access. (I don't want to require dump|load in order to lose
such holes.)

> That MAX doesn't make sense. Each node_rev_id only appears once
> in the index. Maybe that wasn't clear yet? There is exacrty one entry
> for each node-rev-id from revision N in the cache file for revision N.
>

Clear.

> > > for node_rev in $rev {
> > > obtain predecessor p_node_rev of $node_rev
> > > obtain revision p_rev from $p_node_rev
> >
> > Should the cache note which of the successors is the M-successor (the
> > one with the same copy-id)? Perhaps by stipulating that the M-successor
> > is always the first in the list of successors, and leaving room for it
> > if a C-successor (a copy) is created before the M-successor.
>
> We don't know how long the M-successor ID is going to be.
>

Actually we can bound it, but the bound I have in mind is going to be
a large overestimate for the common case.

> If we want to optimise locating the M-successor, we could add the
> offset to the s-block which contains the M-successor to the index,
> and use a special flag byte to terminate the M-successor string.
>

This is probably going to be more space-efficient, yes.

> > (It does mean we have two codepaths though, eg both packed and
> > non-packed shards; and <cue wild idea> perhaps it'll be easier to have
> > *only* packed shards, and to use the expensive replay method to answer
> > queries about revisions younger than the latest pack file? (We may want
> > to use a smaller shard size in this case.)
> >
> > And, by the way, this idea isn't /so/ unwieldy. I think that you can
> > use *only* pack files --- and also use pack files for unfinished shards
> > --- if, when you run out of INDEX_SIZE, you rewrite the whole file and
> > move-into-place a new pack file (with a doubled INDEX_SIZE) on top of
> > the old pack file. The advantage here would be not having the
> > both-packed-and-nonpacked-shards headache to deal with.)
>
> Might be possible. Not sure if that's worth it.
>
> > > write $rev into 'cached-rev' file
> > > flush 'cached-rev' file
> >
> > No, you don't flush, you move-into-place.
>
> I decided to avoid move-into-place on purpose.
>
> This is my biggest question:
>
> Does FSFS presently move stuff into place if the move target location
> already exists? I am asking because I don't think this would work on
> Windows. We cannot overwrite a file if a reader has opened it.
> Or do we really use the delete-retry-loop on Windows servers?
>
> The only situation I can think of is where a previous commit created
> a revision file and didn't update 'current'. In which case it's unlikely
> that the revision file will be open when the next commits moves a file
> with the same name into place.
>

As I said on IRC: look into revprop editing in 1.6 and 1.7 FSFS.

(and probably the locks/ hierarchy, too; see fsfs/lock.c)

> > Okay, so the cache reading procedure MUST change, to account for the
> > non-packed cache file vanishing (by a concurrent packer) and the pack
> > file appearing instead. This is another layer of headache on top of the
> > base one :-(
>
> Do we support packing while the repository is live?

Yes.

> I haden't considered this.
>
> Thanks for your feedback!

Welcome.
Received on 2011-08-30 18:18:38 CEST

This message: [ Message body ]
Next message: Julian Foad: "Support for reverse merges?"
Previous message: Daniel Shahaf: "Re: [PATCH] Don't repeatedly notify the same warning"
In reply to: Stefan Sperling: "Re: FSFS successor ID design draft"
Next in thread: Daniel Shahaf: "Re: FSFS successor ID design draft"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]