Repository GUID design
From: Greg Hudson <ghudson_at_MIT.EDU>
Date: 2002-12-16 06:21:31 CET
I'll take a stab at writing down what a repository guid is all about.
PROPERTIES OF A GUID
A guid is a byte string. The only thing you can do with its contents
From the client's perspective, a <guid, revnum> pair uniquely
From the filesystem's perspective, a <guid, node-revision-id> pair
Two repositories should have the same guid if their contents are
The guid provides no information about how to retrieve repository
---- Working copies could use the guid to verify that their base contents are valid. "svn switch --relocate" can use the guid to verify that the repository being switched to is identical to the repository being switched from. Copy history could be passed as <guid, path, rev> instead of <url, rev>, although that would make more sense for ra_svn than it does for ra_dav. If we did pass copy history this way, our checks to ensure that copy history comes from the same repository could be made more robust. Repeated merges: Ancestry information can be kept using <guid, rev, path> tuples and ranges. This could be used to avoid applying deltas repeatedly in a merge operation, even in a future world where we allow merges between URLs in different repositories. Read-only mirroring: The guid could be added to the node-revision-id (either directly or through a proxy key), and servers could be allowed to pull content from other repositories. Distributed content: Given the previous two extensions, the FS could allow copy history to come from other repositories. This would have wide-ranging impact on many operations. FAILURE MODES ------------- Here are guid-related circumstances which might cause code to fail. 1. Two different repositories have the same guid. This could happen if two guids are accidentally invented the same or if one guid is corrupted so as to be equal to another, but the most likely circumstances are: * A repository is copied, using a method which preserves the guid, and the copy is allowed to diverge. * Two repositories are being kept in sync through out-of-band means which preserve the guid, but the synchronization is imperfect. (A repository could also be recovered from backup, preserving the guid, and the more recent revisions which were lost by restoring from backup could be replaced by different contents, invalidating a working copy's base-text. But that's not a guid-related failure mode.) The following things could go wrong in this failure mode: * An "svn switch --relocate" could succeed when it ought to fail; the working copy's base-text and base-props are not valid for the new repository's contents. * Copy history could be treated as valid (from the same repository) even though it is invalid. * Deltas might not be applied during a merge even though they should be, because the merge target appears to already have deltas which it does not have. * A repository might refuse to pull content from another repository because the two repositories have the same guid. * A repository might misfile content pulled from another repository because the source repository has the same guid as a different repository which content was previously pulled from. 2. Two identical repositories have different guids. This could happen if: * A guid is corrupted or is accidentally changed. * Two repositories are kept in sync through out-of-band means which do not preserve the guid. * A repository is moved or restored from backup or converted from an old db schema through a means which does not preserve the guid. The following things could go wrong in this failure mode: * A working copy could erroneously refuse to operate, thinking its base contents are invalid. * An "svn switch --relocate" could fail when it ought to succeed. * Copy history could be treated as invalid even though it is valid. * Deltas might be repeatedly applied during a merge when they shouldn't be. * A repository might misfile or lose contents pulled from another repository. IMPLEMENTATION CHOICES ---------------------- 1. How do we store the guid? We could store it in a rev-0 property and access it through the rev-prop ra layer commands; or we could store it elsewhere (in a BDB table or just a file) and invent new ways to access it. Considerations: * Properties are a convenient end-to-end path between the FS and the client code. Using them means the interior layers (libsvn_client, libsvn_ra_*, libsvn_repos) don't have to become more complicated, but the exterior layers (libsvn_wc, libsvn_fs_*) grow in a somewhat less natural way. A good way of exploring this issue is to imagine that Unix inode fields were implemented using properties instead of fixed fields. The system call interface wouldn't have had to change when new information was added (like the "immutable" and "compressed" attributes in ext2fs), but the kernel fs code for getting and setting an inode property would be kind of gross, as would the exception-handling discipline in userland. (Did I get an EPERM because I can't set any properties on this inode, or because I can't set that particular property to that particular value?) * If rev-props acquire checksums, then we'll get a checksum on the guid for free if we use a rev-prop; that checksum would make it easier to diagnose a corrupted guid. (But I'm not convinced anyone will ever experience a corrupted guid.) * We should try to be architecturally consistent. If we don't use rev-props for the guid because we don't want to complicate the ends of the properties pipe, then did we make the right choice to use rev-props for commit information? 2. How hard do we make it to change the guid? A guid needs to be changed to a known value if a repository was copied in a manner not preserving the guid, and the new repository is expected to be kept in sync with the old one through out-of-band means. Also if the repository was restored from backup or converted from an old db schema in a manner not preserving the guid. A guid needs to be changed to a newly-invented value if a repository was copied in a manner preserving the guid, and the new repository is expected to diverge from the old one. Bad things will happen if someone naively changes the guid, but this may be an unlikely user error, especially if the template rev-prop hook script disallows it. If we make it hard to copy the repository without preserving the guid (as would be the case if it's a rev-0 property), then people should only need the "change to newly-invented value" operation under normal circumstances. Of course, if they do that by accident, suddenly they want the "oops, change it back, please" operation. 3. Should dump/load preserve the guid? Unfortunately, we don't know whether a load is being done to restore the repository from backup, or to convert it from an old db schema, or to keep it in sync with another repository, or to create a new and divergent repository with similar contents to the old one. But I think the default should be to preserve the guid. Creating a divergent repository with dump/load doesn't seem like a very common case. --------------------------------------------------------------------- To unsubscribe, e-mail: firstname.lastname@example.org For additional commands, e-mail: email@example.comReceived on Mon Dec 16 06:22:17 2002
This is an archived mail posted to the Subversion Dev mailing list.