Re: RFC: Delta indexing and composition

From: Paul D. Smith <pausmith_at_nortelnetworks.com>
Date: 2001-07-30 20:33:33 CEST

%% kfogel@collab.net writes:

  k> The other thing I've been thinking about is planning ahead for storing
  k> large files outside the repository db tables... or, perhaps, storing
  k> all "real" data outside the tables? That is, perhaps we should
  k> reserve db tables for metadata, and store file contents (and prop
  k> lists? and dir entries lists?) out in the fs, using the usual sort of
  k> directory/file hash scheme that one works up for such circumstances.

It might be somewhat illuminating to examine what other tools do here.

For example, in ClearCase the VOB (repository) storage is broken up into
four different pieces, called "pools":

db - The metadata database pool: labels, attributes, comment strings,
etc. There are actually a couple of discrete databases stored
here.

      My understanding is that string data itself is not actually stored
      in the DB (they use a Raima RDBMS internally), but rather is kept
      externally in an indexed, dedicated "strings database". Those
      comment strings, for example, can get very large and really it
      isn't necessary to have them in the same DB as the other metadata.
      In ClearCase many groups were running out of space in these
      databases before 64-bit file sizes came out and were supported.

      Actually the strings database contains all the names of all the
      metadata: for example the names of labels and branches are there
      as well. Internally to the database every object is given an OID
      and the OID is the only thing referenced in the _real_ metadata
      DB, and used in those operations. These OIDs are visible to the
      user (you can use a special command to retrieve them) and that can
      be _really_ useful sometimes: consider that you want to send me a
      reference a specific version of a specific element--well, that is
      very hard in a tool like ClearCase (and Subversion) due to
      directory versioning (if I just give you a filename and a version
      you very well might not be able to see that element in your
      workspace--in fact, there might be an element with that same name
      visible in your workspace, and it's a _different element_!) I
      can retrieve the OID for that specific version, and give that to
      you. Then, in your workspace, you can use a command to translate
      that back into a version reference that works for you, showing the
      version information for each directory that is needed to find the
      element, for example.

      It's only when labels, branches, etc. need to be shown to the user
      that the strings database is queried to get the user-friendly
      name for a given OID.

s - The source pool: this contains the actual source. In ClearCase
      you can have multiple different storage container types: whole
      file containers (usually for binary files), text delta containers
      (similar to traditional SCM tools where the diffs are stored),
      etc.

c - A "cleartext" pool: this is basically a cache. When a request
      comes into the server for a particular version of a particular
      element, the cache is consulted and if it exists there, it is
      simply returned. If it doesn't, it is reconstructed from the
      source pool and placed in the cache, then returned.

      ClearCase relies extensively on NFS (it's not a remote development
      tool by any stretch) so actually what's returned by the server is
      not the contents of the file, but rather just a pathname of the
      file in the cleartext cache. All clients are expected to have
      the pool directories available via NFS, so they can just access
      the file directly (if, for example, you put your cleartext pools
      on an NFS fileserver this can greatly reduce the load on your VOB
      server as the clients all get it directly from the NFS box).

d - The derived object pool; this is for ClearCase's special build
avoidance version of make; it's of little interest to Subversion I
expect :).

Within the source and cleartext (and DO) pools, files are stored as
you've suggested, using a directory/file hashing scheme. So, a
cleartext filename might be something like:

c/cdft/1/10/d67d0ee8c55111d3a1020001809321e8

The contents of this would be an actual .c file or whatever
reconstructed from the repository. I don't have the first idea how this
hashing works, and it's probably not important anyway.

The "cdft" is slightly interesting: you can actually create multiple
instances of each of the pools above (except the db pool of course), and
you can use simple UNIX symlinks, etc. to spread them out across
multiple systems or whatever if you like.

Much of this is probably not what you want to be thinking about at this
stage of Subversion development, but perhaps it's interesting on some
level, if only to consider things you might need/want to do in the
future.

-- 
-------------------------------------------------------------------------------
 Paul D. Smith <psmith@baynetworks.com>    HASMAT--HA Software Methods & Tools
 "Please remain calm...I may be mad, but I am a professional." --Mad Scientist
-------------------------------------------------------------------------------
   These are my opinions---Nortel Networks takes no responsibility for them.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Sat Oct 21 14:36:33 2006

This message: [ Message body ]
Next message: kfogel_at_collab.net: "Re: Java Library"
Previous message: Kevin Pilch-Bisson: "Re: Java Library"
In reply to: kfogel_at_collab.net: "Re: RFC: Delta indexing and composition"
Next in thread: Branko Èibej: "Re: RFC: Delta indexing and composition"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]