%% kfogel@collab.net writes:
k> The other thing I've been thinking about is planning ahead for storing
k> large files outside the repository db tables... or, perhaps, storing
k> all "real" data outside the tables? That is, perhaps we should
k> reserve db tables for metadata, and store file contents (and prop
k> lists? and dir entries lists?) out in the fs, using the usual sort of
k> directory/file hash scheme that one works up for such circumstances.
It might be somewhat illuminating to examine what other tools do here.
For example, in ClearCase the VOB (repository) storage is broken up into
four different pieces, called "pools":
db - The metadata database pool: labels, attributes, comment strings,
etc. There are actually a couple of discrete databases stored
here.
My understanding is that string data itself is not actually stored
in the DB (they use a Raima RDBMS internally), but rather is kept
externally in an indexed, dedicated "strings database". Those
comment strings, for example, can get very large and really it
isn't necessary to have them in the same DB as the other metadata.
In ClearCase many groups were running out of space in these
databases before 64-bit file sizes came out and were supported.
Actually the strings database contains all the names of all the
metadata: for example the names of labels and branches are there
as well. Internally to the database every object is given an OID
and the OID is the only thing referenced in the _real_ metadata
DB, and used in those operations. These OIDs are visible to the
user (you can use a special command to retrieve them) and that can
be _really_ useful sometimes: consider that you want to send me a
reference a specific version of a specific element--well, that is
very hard in a tool like ClearCase (and Subversion) due to
directory versioning (if I just give you a filename and a version
you very well might not be able to see that element in your
workspace--in fact, there might be an element with that same name
visible in your workspace, and it's a _different element_!) I
can retrieve the OID for that specific version, and give that to
you. Then, in your workspace, you can use a command to translate
that back into a version reference that works for you, showing the
version information for each directory that is needed to find the
element, for example.
It's only when labels, branches, etc. need to be shown to the user
that the strings database is queried to get the user-friendly
name for a given OID.
s - The source pool: this contains the actual source. In ClearCase
you can have multiple different storage container types: whole
file containers (usually for binary files), text delta containers
(similar to traditional SCM tools where the diffs are stored),
etc.
c - A "cleartext" pool: this is basically a cache. When a request
comes into the server for a particular version of a particular
element, the cache is consulted and if it exists there, it is
simply returned. If it doesn't, it is reconstructed from the
source pool and placed in the cache, then returned.
ClearCase relies extensively on NFS (it's not a remote development
tool by any stretch) so actually what's returned by the server is
not the contents of the file, but rather just a pathname of the
file in the cleartext cache. All clients are expected to have
the pool directories available via NFS, so they can just access
the file directly (if, for example, you put your cleartext pools
on an NFS fileserver this can greatly reduce the load on your VOB
server as the clients all get it directly from the NFS box).
d - The derived object pool; this is for ClearCase's special build
avoidance version of make; it's of little interest to Subversion I
expect :).
Within the source and cleartext (and DO) pools, files are stored as
you've suggested, using a directory/file hashing scheme. So, a
cleartext filename might be something like:
c/cdft/1/10/d67d0ee8c55111d3a1020001809321e8
The contents of this would be an actual .c file or whatever
reconstructed from the repository. I don't have the first idea how this
hashing works, and it's probably not important anyway.
The "cdft" is slightly interesting: you can actually create multiple
instances of each of the pools above (except the db pool of course), and
you can use simple UNIX symlinks, etc. to spread them out across
multiple systems or whatever if you like.
Much of this is probably not what you want to be thinking about at this
stage of Subversion development, but perhaps it's interesting on some
level, if only to consider things you might need/want to do in the
future.
--
-------------------------------------------------------------------------------
Paul D. Smith <psmith@baynetworks.com> HASMAT--HA Software Methods & Tools
"Please remain calm...I may be mad, but I am a professional." --Mad Scientist
-------------------------------------------------------------------------------
These are my opinions---Nortel Networks takes no responsibility for them.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:36:33 2006