FSFS successor ID design draft

From: Stefan Sperling <stsp_at_elego.de>
Date: Fri, 26 Aug 2011 13:14:32 +0200

Below is an initial draft of a design for successor-IDs in FSFS.
It is a bit long, but I hope it covers all relevant points in
sufficient detail.

Please share your thoughts on this. I have most likely missed a few things.

If we like it I will put it into the repository for further editing,
either in notes/ on trunk or onto the fsfs-successor ID branch.

If this design has some severe problem I don't mind going back to
the drawing board. This is my first shot, afterall :)

cmpilato, does this significantly differ from what you've done for BDB?
Will both backends have the same (or similar) behaviour if we use
this design for FSFS?

Thanks to neels and danielsh for all your input so far.
You've been tremendously helpful!

I would also be happy to see alternative proposals.

FSFS successor IDs
==================

See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
for what this is all about.

What follows are ideas about implementing successor ID support for FSFS.

Useful background information:
  Node-revisions: subversion/libsvn_fs_base/notes/fs-history
                  subversion/libsvn_fs_base/notes/structure
  fsfs design: subversion/libsvn_fs_fs/structure

The "cache" and its properties
==============================

Storage of successor IDs is essentially a cache of the result of the
inversion of the predecessor relation between all node-revisions.
This inversion is a *very* expensive operation (follow every node
that ever existed in the repository from its most recent revision
back to the revision it was created in).

In the following, "cache" refers to the store for successor IDs.

After a general introduction, a proposed cache design is presented.
Note that the cache implementation will be hidden behind the API of
libsvn_fs_fs and can be improved or changed in every new minor
release of Subversion.

Required properties of successor ID cache for FSFS:
- FSFS-compatible:
   - cache must use some plain file format
   - writers do not block readers (i.e. sqlite is out of the question)
   - cache must not rely on modifying existing revision files
- cache can be regenerated on demand
- do not corrupt the cache in case of an unexpected crash
- try to keep number of directory entries low
- conserve inodes by design (preferred), or via packing support

Nice-to-have properties:
- the repository can be live during cache generation

Storage size requirements
=========================

Ballpark figures about node-revs from #svn-dev:
  * cmpilato does some math. There are over 350,000 node revisions in my
      ancient copy of the svn.collab.net Subversion repos.
  <hwright> there are over 20M of them in the ASF repo

How many successors does a given node-rev have?

There is at most one successor on the same line of history (same copy_id).
Each copy operation also creates one new successor.

How big will the cache become, in total?

  The successor cache does not store entire node-revs, just node-rev IDs.
  These are generally not longer than 40 bytes (less than half a line
  in a 80x25 terminal being the rough estimate).

  The amount of raw information to store per node-revision is the
  size of this mapping:
    node_rev_id => successor_id

Each entry is at least 80 bytes large, 40 bytes for the node_rev_id
and 40 bytes for the successor ID.

  Each new node-revision adds one entry to the map.
  Since each successor is itself a node-revision, the entire cache
  has at most as many entries as the amount of node-revs in the
  repository.

  More precisely, the amount of successors listed in the cache is
  equal to the number of node revisions in the repository, minus
  the number of nodes in HEAD (which don't presently have successors)
  and minus the number of nodes which were deleted in history (they
  don't have successors either).

  In other words, the amount of successors in the cache is equal to
  the amount of pred: entries in revision files.

Let's try to plug in some numbers from the above svn.collab.net case.

  There are 350000 node revs.
    maximum size of successor IDs in cache = 40 bytes * 350000
    maximum size of node-rev IDs in cache = 40 bytes * 350000
    maximum size of cache = 40 bytes * 350000 * 2

This amounts to roughly 28MB worst-case successor cache size for
the svn.collab.net repository (which itself is about 500MB in size).

Implementation proposal
=======================

- On-disk File Format -

All files related to the cache are stored in the
repos/db/successor-id-cache directory ("cache directory").

There is one file which contains the number of the revision the
cache was last synced to: repos/db/successor-id-cache/cached-rev
This is used by both readers and writers.

The directory also contains files named after revisions,
and is sharded modulo some value TBD:

successor-id-cache/0/1
successor-id-cache/0/2
successor-id-cache/0/3
...
successor-id-cache/1/N
successor-id-cache/1/M
...

These individual files store successors of all node-revs of a given revision
(or, optionally, a number of revisions -- see notes on packing below).

Each file starts with an index, followed by zero or more successors-blocks
("s-block" for short).

The format of the index is shown below. It contains the index size
(which is a fixed value) and one entry for each node_rev that appears
in the revision(s) the file is responsible for.

The index size and offsets are 64 bit values.
(32 bit index size would probably be sufficient, but let's not count peanuts.)
node_rev_id is a variable length ASCII string terminated by '\0'.

If a node-rev does not have any successors, both s-block offsets are zero.
This makes finding the answer to the question "does this node have any
successors?" an O(1) operation (only one disk block needs to be read).

If none of the node-revs have successors yet, the file ends here.

If a node-rev has successors, the file contains one or more s-blocks.
An s-block lists successors of the node-rev and provides some amount of
pre-allocated space for future successors. The offset to the first s-block
helps readers, the offset to the last s-block helps writers (the next
section provides details).

The padding at the end of the index ensures that all s-block boundaries
lie on disk sector boundaries (more details later).

The format of an s-block is shown below. It has a fixed size (SBLOCK_SIZE).
The offsets are 64 bit values. successor_id is a variable length ASCII
string terminated by '\0'.

+----------------------------------------------------------------------+
| offset of next s-block |
+----------------------------------------------------------------------+
| offset of free space in this s-block |
+----------------------------------------------------------------------+
| successor_id, successor_id, ... |
+----------------------------------------------------------------------+
| free space zero-padded to SBLOCK_SIZE |
+----------------------------------------------------------------------+

What is a good value for SBLOCK_SIZE?
Modern disks have a block size of 8192 bytes (some disks even use this
block size internally but report 512 bytes of block size to the OS).
Using this block size means that each s-block will fit into one sector
on modern disks, and into 16 sectors on older disks using 512 byte sectors.
In both cases an s-block can be read in a single sequential read.

Recall that a successor_id will take about 40 bytes. Let's round that
up to 64 for good measure. Applying advanced mathematics tells us that
8192 bytes will provide enough space for at least 128 successors of a
node-rev-id. Which means we'll have to allocate a new s-block about every
128 successors (usually a slightly higher amount of successors should fit).
If more successors must fit SBLOCK_SIZE should be some multiple of 8192.

If this s-block is the last s-block for this node-rev, the offset of
the next s-block is zero.

The offset of free space within the s-block is used by readers when
reading existing successors and by writers when appending new successors.
(Note that in the implementation the offset of free space might also
be a 16 bit number relative to the start of the s-block, instead of
being a 64 bit number. But let's ignore such optimisations for now.)

- Procedure for Readers -

A reader first reads the cached_rev file and remembers the revision number.

It then locates the cache file corresponding to the given node_rev_id.
The name of the file is keyed on the revision of the node-rev. (Without
packing the file has the same name as the revision. With packing the
file is named after this revision modulo some fixed and known value.)

It opens this file and reads the index to learn the offset of the
first s-block of the node_rev_id.

If the offset is zero, the node-rev has no successors, and the search ends.

The reader seeks to the offset of the first s-block.

It reads the offset of free space within the s-block, and then reads
successors from the s-block up to this offset.

If the offset to the next s-block is zero, the search ends.
If the offset to the next s-block is non-zero, the reader
proceeds to the next s-block.

When the search ends, the reader returns the list of successors
and the number of the cached_rev, which indicates up to which
revision the successor list is valid. (This is a lower bound; the
successor list might already contain some values for newer revisions
if a writer is concurrently updating the cache. However, there is no
harm done if the reader uses this information because it pertains to
already committed revisions.)

Note that there are only O(number of s-blocks) disk seeks involved.
Any time data is read from disk one or more disk blocks are read sequentially.

- Procedure for Writers -

Note that the following describes how successor data for a single
node_rev_id is updated. This is only a sub-step of the global cache
synchronisation procedure (which is described in the next section).

A writer operates under the assumption that the entire cache directory
has been locked for writing.

The writer locates the cache file corresponding to the given node_rev_id.

The writer opens this file and reads the index to learn the offset of the
last s-block of the node_rev_id.

The writer seeks to the offset of the last s-block.

The writer reads the offset of remaining free space within the s-block,
and calculates the amount of free space left within the s-block (recall
that the s-block has fixed size SBLOCK_SIZE).

The writer writes successor IDs to free space within the s-block,
and flushes the cache file to disk.

The writer updates the offset of free space in the s-block, and
flushes the cache file to disk.

(At this point readers get to see the new successor IDs.)

At any point, if remaining free space is too small for all new successors,
the writer must allocate a new s-block.

To do so, it seeks to the very end of the file and appends a new s-block
which initially contains only zeros.

The writer writes successors to free space in the new s-block.
It updates the offset to free space in the new s-block, and flushes
the cache file to disk.

It updates the 'next s-block' offset in the previous s-block to the
offset of the new s-block, and flushes the cache file to disk.

If the writer crashes, the following cases can happen:

  - New successors have been added to an s-block, but the offset to
    free space has not been updated.
    => Newly added successors will be overwritten by the next writer,
       and readers don't see them.
  - A new s-block has been written, but the 'next s-block' offset at
    the previous s-block has not been updated.
    => An unused s-block will remain in the file.
       The next writer adding successors of the same node-rev will
       append a new s-block to the file. Readers don't see the unused s-block.
       If the cache is rebuilt from scratch the unused s-block disappears.

Thus, cache consistency is guaranteed even in case of crashes.

- Cache Synchronisation -

The procedure for synchronising the cache is as follows.
Note that it assumes a write lock on the cache is held.

This procedure is run during post-commit processing after 'current'
has been updated. It can also be run manually via svnadmin.

sync_successor_id_cache() {
open 'cached-rev' file
read $cached_rev from 'cached-rev' file

  open 'current' file
  read $new_cached_rev from 'current' file
  close 'current' file

  # The new_successors hash table is keyed by a node_rev_id.
  # Its values are lists of successor IDs.
  #
  # The on-disk cache file format is designed for batch updates.
  # It is inefficient to add new successors to cache files one by one.
  # To update cache files efficiently, this table will grow up to a certain
  # upper bound and be flushed to disk periodically.
  new_successors = {}

for rev in [$cached_rev, ..., $new_cached_rev] {

    for node_rev in $rev {
      obtain predecessor p_node_rev of $node_rev
      obtain revision p_rev from $p_node_rev
      if cache file $p_rev exists {

# this uses the 'reader' procedure as explained above
look up $p_node_rev:$node_rev in cache file

        if $p_node_rev:$node_rev is already listed
          continue with next node_rev
        else
          new_successors[$p_node_rev] += $node_rev
      } else {
        new_successors[$p_node_rev] += $node_rev
      }
    }

    if space required by new_successors reached upper bound {
      for node_rev in new_successors.keys() {
        # update_cache_file() implements the 'writer' procedure explained above
        update_cache_file($node_rev, new_successors[$node_rev])
      }
      write $rev into 'cached-rev' file
      flush 'cached-rev' file
      new_successors = {}
    }
  }

  if new_successors != {} {
    for node_rev in new_successors.keys() {
      # update_cache_file() implements the 'writer' procedure explained above
      update_cache_file($node_rev, new_successors[$node_rev])
    }
    write $new_cached_rev into 'cached-rev' file
  }

close 'cached-rev' file
}

If this code crashes, the following can happen:
  - The cache contains successors computed for revisions beyond the
    revision listed in the 'cached-rev' file.
    => The next writer will not add these successors again. No harm done.

- Packing -

The cache can optionally be packed. This means that successors of
node_revs from more than one revision live in the same cache file.

The on-disk format and cache synchronisation procedure does not change.
The only thing that changes is the procedure to look up the name of the
cache file responsible for a given node-rev. The cache file is named
after the node-rev's revision modulo some fixed and known value.

To create the index of a packed cache file the combined number of
node-revs in all revisions that share the same cache file must be known.
Thus, a packed cache file can only be created after all revisions
which share the cache file have been committed.

Packing can be performed with 'svnadmin pack'.

- How does this design live up to our requirements? -

- FSFS-compatible:
   - cache must use some plain file format
       => yes
   - writers do not block readers
       => yes, a reader will get consistent data even while the cache
          is being written to
   - cache must not rely on modifying existing revision files
       => yes
- cache can be regenerated on demand
     => yes
- do not corrupt the cache in case of an unexpected crash
     => the cache can only contain more information than claimed by
        the 'cached-rev' file
- try to keep number of directory entries low
     => yes, via sharding
- conserve inodes by design (preferred), or via packing support
     => yes, via optional packing

- the repository can be live during cache generation
=> yes, but readers will not see all successors until the cache
is fully populated
Received on 2011-08-26 13:15:22 CEST

This message: [ Message body ]
Next message: Daniel Shahaf: "Re: [Issue 2685] Move + Merge => lose modifications"
Previous message: Martin Kutter: "Re: Bug: JavaHL does not transmie post-commit error messages to caller (was: Fwd: Re: JavaHL bindings - post-commit error messages)"
Next in thread: Ivan Zhakov: "Re: FSFS successor ID design draft"
Reply: Ivan Zhakov: "Re: FSFS successor ID design draft"
Reply: Daniel Shahaf: "Re: FSFS successor ID design draft"
Reply: Daniel Shahaf: "Re: FSFS successor ID design draft"
Reply: Johan Corveleyn: "Re: FSFS successor ID design draft"
Reply: Stefan Fuhrmann: "Re: FSFS successor ID design draft"
Reply: C. Michael Pilato: "Re: FSFS successor ID design draft"
Reply: Hyrum K Wright: "Re: FSFS successor ID design draft"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]