[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Repository GUID design

From: Greg Hudson <ghudson_at_MIT.EDU>
Date: 2002-12-16 06:21:31 CET

I'll take a stab at writing down what a repository guid is all about.


A guid is a byte string. The only thing you can do with its contents
is compare them with another guid's contents for equality. Ideally,
new guids are made up in such a way that the same guid is never
invented twice.

From the client's perspective, a <guid, revnum> pair uniquely
identifies a revision. If the guid is the same and the revnum is the
same, the contents should be the same.

From the filesystem's perspective, a <guid, node-revision-id> pair
uniquely identifies a node-revision. If the guid is the same and the
node-revision-id is the same, the contents should be the same.

Two repositories should have the same guid if their contents are
expected to always be the same. Otherwise, they should have different

The guid provides no information about how to retrieve repository
contents; it merely identifies it.


Working copies could use the guid to verify that their base contents
are valid.
"svn switch --relocate" can use the guid to verify that the repository
being switched to is identical to the repository being switched from.
Copy history could be passed as <guid, path, rev> instead of <url,
rev>, although that would make more sense for ra_svn than it does for
ra_dav.  If we did pass copy history this way, our checks to ensure
that copy history comes from the same repository could be made more
Repeated merges: Ancestry information can be kept using <guid, rev,
path> tuples and ranges.  This could be used to avoid applying deltas
repeatedly in a merge operation, even in a future world where we allow
merges between URLs in different repositories.
Read-only mirroring: The guid could be added to the node-revision-id
(either directly or through a proxy key), and servers could be allowed
to pull content from other repositories.
Distributed content: Given the previous two extensions, the FS could
allow copy history to come from other repositories.  This would have
wide-ranging impact on many operations.
Here are guid-related circumstances which might cause code to fail.
1. Two different repositories have the same guid.
This could happen if two guids are accidentally invented the same or
if one guid is corrupted so as to be equal to another, but the most
likely circumstances are:
  * A repository is copied, using a method which preserves the guid,
    and the copy is allowed to diverge.
  * Two repositories are being kept in sync through out-of-band means
    which preserve the guid, but the synchronization is imperfect.
(A repository could also be recovered from backup, preserving the
guid, and the more recent revisions which were lost by restoring from
backup could be replaced by different contents, invalidating a working
copy's base-text.  But that's not a guid-related failure mode.)
The following things could go wrong in this failure mode:
  * An "svn switch --relocate" could succeed when it ought to fail;
    the working copy's base-text and base-props are not valid for the
    new repository's contents.
  * Copy history could be treated as valid (from the same repository)
    even though it is invalid.
  * Deltas might not be applied during a merge even though they should
    be, because the merge target appears to already have deltas which
    it does not have.
  * A repository might refuse to pull content from another repository
    because the two repositories have the same guid.
  * A repository might misfile content pulled from another repository
    because the source repository has the same guid as a different
    repository which content was previously pulled from.
2. Two identical repositories have different guids.
This could happen if:
  * A guid is corrupted or is accidentally changed.
  * Two repositories are kept in sync through out-of-band means which
    do not preserve the guid.
  * A repository is moved or restored from backup or converted from an
    old db schema through a means which does not preserve the guid.
The following things could go wrong in this failure mode:
  * A working copy could erroneously refuse to operate, thinking its
    base contents are invalid.
  * An "svn switch --relocate" could fail when it ought to succeed.
  * Copy history could be treated as invalid even though it is valid.
  * Deltas might be repeatedly applied during a merge when they
    shouldn't be.
  * A repository might misfile or lose contents pulled from another
1. How do we store the guid?
We could store it in a rev-0 property and access it through the
rev-prop ra layer commands; or we could store it elsewhere (in a BDB
table or just a file) and invent new ways to access it.
  * Properties are a convenient end-to-end path between the FS and the
    client code.  Using them means the interior layers (libsvn_client,
    libsvn_ra_*, libsvn_repos) don't have to become more complicated,
    but the exterior layers (libsvn_wc, libsvn_fs_*) grow in a
    somewhat less natural way.
    A good way of exploring this issue is to imagine that Unix inode
    fields were implemented using properties instead of fixed fields.
    The system call interface wouldn't have had to change when new
    information was added (like the "immutable" and "compressed"
    attributes in ext2fs), but the kernel fs code for getting and
    setting an inode property would be kind of gross, as would the
    exception-handling discipline in userland.  (Did I get an EPERM
    because I can't set any properties on this inode, or because I
    can't set that particular property to that particular value?)
  * If rev-props acquire checksums, then we'll get a checksum on the
    guid for free if we use a rev-prop; that checksum would make it
    easier to diagnose a corrupted guid.  (But I'm not convinced
    anyone will ever experience a corrupted guid.)
  * We should try to be architecturally consistent.  If we don't use
    rev-props for the guid because we don't want to complicate the
    ends of the properties pipe, then did we make the right choice to
    use rev-props for commit information?
2. How hard do we make it to change the guid?
A guid needs to be changed to a known value if a repository was copied
in a manner not preserving the guid, and the new repository is
expected to be kept in sync with the old one through out-of-band
means.  Also if the repository was restored from backup or converted
from an old db schema in a manner not preserving the guid.
A guid needs to be changed to a newly-invented value if a repository
was copied in a manner preserving the guid, and the new repository is
expected to diverge from the old one.
Bad things will happen if someone naively changes the guid, but this
may be an unlikely user error, especially if the template rev-prop
hook script disallows it.
If we make it hard to copy the repository without preserving the guid
(as would be the case if it's a rev-0 property), then people should
only need the "change to newly-invented value" operation under normal
circumstances.  Of course, if they do that by accident, suddenly they
want the "oops, change it back, please" operation.
3. Should dump/load preserve the guid?
Unfortunately, we don't know whether a load is being done to restore
the repository from backup, or to convert it from an old db schema, or
to keep it in sync with another repository, or to create a new and
divergent repository with similar contents to the old one.
But I think the default should be to preserve the guid.  Creating a
divergent repository with dump/load doesn't seem like a very common
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Dec 16 06:22:17 2002

This is an archived mail posted to the Subversion Dev mailing list.