[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: FSFS successor ID design draft

From: Daniel Shahaf <danielsh_at_elego.de>
Date: Mon, 5 Sep 2011 13:23:14 +0300

Stefan Sperling wrote on Mon, Sep 05, 2011 at 11:38:11 +0200:
> On Sun, Sep 04, 2011 at 02:23:28AM +0300, Daniel Shahaf wrote:
> > > Each node-revision map file stores a mapping of the form
> > > node_rev_id => [revnum, revnum, ...]
> > > The revision numbers identify those revisions in which one or more
> > > successors of a given node-revision were added.
> > >
> > > In the first iteration of this design, the mapping is stored as lines
> > > of ASCII text. Each line contains an unparsed node_revision_id, a space
> > > separator (ASCII 0x20), and a revision number in ASCII decimal representation.
> > > (The design may be refined later to use a more efficient storage model.)
> > >
> >
> > Some wild ideas here:
> >
> > - Multiple, concatenated DELTA\n..ENDREP\n svndiff streams. That's
> > appendable and compressible (a deltify process can combin adjacent
> > streams, which will trigger zlib compression).
> >
> > (That won't work for the successors data since offsets into it are
> > stored elsewhere.)
>
> So you're saying that we should run the plaintext proposed above
> through svndiff? Can you explain in more detail how this would work?
> What is the base of a delta?
>

The file contains one or more DELTA\n..ENDREP\n streams:

  DELTA
  <svndiff stream>
  ENDREP
  DELTA
  <svndiff stream>
  ENDREP

(In second thought, we should be storing the length of the stream
somewhere; on the DELTA header seems a fine place:

  DELTA 512
  <512 bytes of svndiff stream>
  ENDREP
  DELTA 37
  <37 bytes of svndiff stream>
  ENDREP

.) When the file is read, readers decode all the deltas and concatenate
the resulting plaintexts. When the file is rewritten, writers
optionally combine the first N deltas into a single delta that produces
the combined plaintext.

The deltas can be self-compressed (like a DELTA\n rep in the revision
files), ie, having no base.

> > - Some packing support that also implements grouping of all the map
> > entries with the same lhs? Is that needed, or should the API at this
>
> What is 'lhs'?
>

lhs = left-hand side
rhs = right-hand side

> > level only implement 'slurp the entire map into a hash mapping ints to
> > arrays of revnums'?
>
> I would say so, with the possible retriction that only a certain range
> of revisions is slurped into memory. This way, the API can optimise for
> callers which are only interested in a certain revision range (e.g.
> graphing tools can reconstruct the history of a file in several steps).
>
> > (OT: those terms for the three kinds of files you have haven't sat well
> > with me; from the term alone I couldn't instantly recall what the
> > contents of the file were.)
>
> What would you call them? I'm open to suggestions.
>

How about calling them after ths RHS'es of the mappings rather than
after the fact that they are mappings?

Currently:

- noderev map file, revision map file, successors data file

Perhaps:

- noderev posterity file, successor offsets file, successors data file

(Is 'progeny' the more appropriate word here? I'm not sufficiently
familiar with its use, and there are no mentions of it as a simple noun
in the dev@ archives...)

> > > In exceptional cases (see below) it is possible that a given revision
> > > does not actually contain any successors of the given node_rev_id.
> > >
> >
> > Implementation question: would it make sense to have a mode for
> > inquiring about several noderev's successors at the same time, in order
> > to amortize the lookup operation (three open()s, several read()s of the
> > successor data file)'s cost across them?
>
> Yes, sure.
>
> The API could take a list of node-revision IDs and return a hash table
> keyed on node-revision ID, with the values being arrays of revision offsets.
>

+1

> > > When committing a finished transaction (in fs_fs.c:commit_body()),
> > > the writer obtains a write-lock on all files within the successors
> > > directory it needs to modify.
> > >
> >
> > Clarification: do you propose that a single write lock be obtained to
> > cover the entire successors/** hierarchy?
> >
> > Also: is the FSFS write lock (which is held by commit_body() whilst it
> > runs) sufficient for this?
>
> I would say the global write lock is sufficient.
>
> I wasn't sure how exactly locking in FSFS works. You can read this
> statement as "the writer has some kind of write lock on everything it
> needs to change, possibly implied by a global write lock".
>

Okay.

> > > If rN falls within the revision range of an existing revision map file,
> > > it copies ((N-2)*4) bytes of content from the revision map file into the
> > > temporary file. (Note that this step discards data left behind by any
> > > previous attempt at committing rN).
> > >
> >
> > ((N-2)*4) is bogus for N=1, and it's also missing a modulo operation
> > somewhere.
>
> Math is hard!
>
> > But the gist is clear: copy the four bytes of every
> > preceding revision in the shard.
>
> Yup.
>

:-)

> > > If the writer fails to update all files in the successors directory,
> > > it will also fail to update 'current'. In this case, readers will ignore
> > > the new successor data until another commit succeeds in creating rN.
> > > Once a commit of rN has succeeded, readers following node-rev-id map
> > > entries updated by the failing commit might end up with an empty
> > > successors list for rN. Such erroneous entries will not be cleaned
> > > up automatically, but can be eliminated by re-creating the repository
> > > (e.g. via a dump/load cycle).
> >
> > Near the start you referred to a 'compromise'. Is that the compromise?
>
> This is what I meant, yes.
>

Okay.

> > That said, one issue. Debris in the successor data file or revision map
> > file will be overwritten as soon as an attempt to commit rN succeeds;
> > however, noderev map entries remain until a dump/load. How feasible
> > would it be to also have a cleanup procedure for them?
>
> I am happy to just leave this debris in the files for now.
>
> I would guess that nobody will ever even notice this problem in practice.
> The number of commits failing within the time window where succesor data
> is updated will statistically be very low to begin with.
> Each time it happens we lose a very small fraction of disk space. We also
> suffer a teeny tiny bit of read performance loss for readers of successors
> of the affected node-revision. So what...
>
> If it ever becomes a real problem, people can dump/load.
>

It'll work, but it's a kill a fly with a fleet approach :). Dump/load
the entire history (many MB of svndiffs) only to fix some derived
noderev->offset map data?

> One thing that is a little annoying is that file checksums for successor
> maps in production and backup repositories might not match up. Folks
> might then infer that there is some kind of repository corruption.
> So we should make 'svnadmin verify' print a warning if we find debris
> entries in the map. The warning would also explain that this is harmless
> and does not constitute corruption.
>

Sounds fine. (And another job for the svn_fs__verify() API!)

> > Such a cleanup procedure could grab the commits (and successors, if it's
> > separate) write locks, attempt to read the successors data for all
> > noderevs in a noderev map file, and delete the bogus entries from the
> > noderev map file. (It could delete them either by atomically replacing
> > the file with a new version, or by replacing the first byte of the
> > revnum by some agreed non-[0-9\n] byte representing "This line of the
> > noderev map file shall be ignored".
> >
> > (it must be a single byte because some revnums are represented in ASCII
> > by a single byte)
>
> If you really want to we can always bolt something like this on later.

Agreed.

(Did I just propose another instance of "Let's bolt <another> thing on FSFS!"?)
Received on 2011-09-05 12:33:05 CEST

This is an archived mail posted to the Subversion Dev mailing list.