[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: FSFS successor ID design draft

From: Daniel Shahaf <danielsh_at_elego.de>
Date: Tue, 30 Aug 2011 19:17:47 +0300

Stefan Sperling wrote on Mon, Aug 29, 2011 at 18:01:38 +0200:
> On Sat, Aug 27, 2011 at 02:15:31AM +0300, Daniel Shahaf wrote:
> > > - cache can be regenerated on demand
> >
> > Regenerated *offline*, right? ie, if the cache is lost it can be
> > derived from the revision files (which remain authoritative), but this
> > process need not withstand concurrent readers. (The problem being the
> > need to decrement the cached-rev file)
> Well, decrementing the cached-rev file would have to be done offline.
> But the cache can be rebuilt while the repository is active.
> We could have a flag that says whether the cache is active.
> E.g. after an upgrade from an older repository format, the cache
> would be flagged inactive, until it has been populated.
> When rebuilding the cache we'd take the repository offline,
> mark the cache inactive, remove the cached-rev file, go online,
> and rebuild the cache. Readers will get information for older
> revisions only until the cache has been rebuilt.

We mentioned a format number for the cache files; this "active flag"
almost smells like a format number for the entire cache... (with
format 0 being "inactive")

> > > The index size and offsets are 64 bit values.
> >
> > Stored as? ASCII decimal? (If so: how is it delimited or padded?) Big
> > endian binary?
> Some binary format. Fixed size.

Fair enough.

> > Q: How do you update the 'offset of free space' in a manner that is
> > consistent to readers?
> >
> > ie, if the offset was ASCII ' 800' and the new offset is '1600', you
> > must avoid a reader accidentally seeing '1800'.
> The offsets appear at the start of a disk block.
> They should be a fixed-sized binary value -- ASCII has problems
> as you've pointed out.
> In any case, I am assuming that flushing data to disk flushes one disk
> block. I.e. we can update one block atomically during flush.
> Maybe this assumption isn't valid.


> > (I seem to remember some ideas exchanged on the revprop-packing design
> > discussion --- including one that involved having two offsets written
> > down in the file, and one of them *always* being valid (but not always
> > the same one) --- perhaps Peter will remember...)
> Do you have a link to the archives?

Not exactly, sorry. Most discussions happened in this (huge) thread:

    From: kmradke_at_rockwellcollins.com
    To: dev_at_subversion.apache.org
    Subject: Does fsfs revprop packing no longer allow usage of traditional backup
    Message-ID: <OFF5DC1071.793EFD7C-ON862578BF.005304DA-862578BF.00552DB1_at_rockwellcollins.com>

The crux of the idea, though, is to have two byte locations for the
offset --- say, store an integer at byte 4-7 and another at bytes 8-11
--- and have some logic to choose which one of those to read and treat
as the offset. In the meantime the other integer can be updated

> > So, invariant: given a node-rev, all its s-blocks, except perhaps the
> > last (in order of appearance in the file), have at most <size of the
> > longest possible node-rev-id, minus one> free (zero) bytes at the end.
> Did you mean "at least"?

No. I said that "All completed s-blocks have at most O(1) free bytes
at the end".

> > > It updates the 'next s-block' offset in the previous s-block to the
> > > offset of the new s-block, and flushes the cache file to disk.
> > >
> >
> > At what point does the writer update the cached-rev file?
> The writer doesn't update the cached-rev file. This is done by
> the global cache sync code.
> I forgot to include two steps for the writer, though.
> It needs to update the offset to the last s-block in the index

You forgot to include "two steps", but you only list one step here. Did
you forgot to list the other step which you had originally forgotten to

> > So, a lost block is in the file. You could avoid that by seeking, not
> > to the end of the file but to
> >
> > SELECT (MAX(offset to my last s-block) + SIZEOF(s-block)) FROM index
> >
> > (and, yes, I realize that this is a crash situation. two answers to
> > that: first, it's easy to avoid having an unused block; second, if
> > a file grows an unused block then a DOS can translate to running out of
> > disk space.)
> I don't really understand you here.
> Are you suggesting that the code should check for a dead s-block
> first, before appending to the file?

I'd like to avoid growing 'holes' in the middle of the file that readers
will never access. (I don't want to require dump|load in order to lose
such holes.)

> That MAX doesn't make sense. Each node_rev_id only appears once
> in the index. Maybe that wasn't clear yet? There is exacrty one entry
> for each node-rev-id from revision N in the cache file for revision N.


> > > for node_rev in $rev {
> > > obtain predecessor p_node_rev of $node_rev
> > > obtain revision p_rev from $p_node_rev
> >
> > Should the cache note which of the successors is the M-successor (the
> > one with the same copy-id)? Perhaps by stipulating that the M-successor
> > is always the first in the list of successors, and leaving room for it
> > if a C-successor (a copy) is created before the M-successor.
> We don't know how long the M-successor ID is going to be.

Actually we can bound it, but the bound I have in mind is going to be
a large overestimate for the common case.

> If we want to optimise locating the M-successor, we could add the
> offset to the s-block which contains the M-successor to the index,
> and use a special flag byte to terminate the M-successor string.

This is probably going to be more space-efficient, yes.

> > (It does mean we have two codepaths though, eg both packed and
> > non-packed shards; and <cue wild idea> perhaps it'll be easier to have
> > *only* packed shards, and to use the expensive replay method to answer
> > queries about revisions younger than the latest pack file? (We may want
> > to use a smaller shard size in this case.)
> >
> > And, by the way, this idea isn't /so/ unwieldy. I think that you can
> > use *only* pack files --- and also use pack files for unfinished shards
> > --- if, when you run out of INDEX_SIZE, you rewrite the whole file and
> > move-into-place a new pack file (with a doubled INDEX_SIZE) on top of
> > the old pack file. The advantage here would be not having the
> > both-packed-and-nonpacked-shards headache to deal with.)
> Might be possible. Not sure if that's worth it.
> > > write $rev into 'cached-rev' file
> > > flush 'cached-rev' file
> >
> > No, you don't flush, you move-into-place.
> I decided to avoid move-into-place on purpose.
> This is my biggest question:
> Does FSFS presently move stuff into place if the move target location
> already exists? I am asking because I don't think this would work on
> Windows. We cannot overwrite a file if a reader has opened it.
> Or do we really use the delete-retry-loop on Windows servers?
> The only situation I can think of is where a previous commit created
> a revision file and didn't update 'current'. In which case it's unlikely
> that the revision file will be open when the next commits moves a file
> with the same name into place.

As I said on IRC: look into revprop editing in 1.6 and 1.7 FSFS.

(and probably the locks/ hierarchy, too; see fsfs/lock.c)

> > Okay, so the cache reading procedure MUST change, to account for the
> > non-packed cache file vanishing (by a concurrent packer) and the pack
> > file appearing instead. This is another layer of headache on top of the
> > base one :-(
> Do we support packing while the repository is live?


> I haden't considered this.
> Thanks for your feedback!

Received on 2011-08-30 18:18:38 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.