[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Tags and scalability (was Re: Enlightenment)

From: Ben Collins-Sussman <sussman_at_red-bean.com>
Date: 2002-09-25 03:18:19 CEST

Kean Johnston wrote:

> Although it may be a "relatively cheap" operation to copy
> a directory, please consider the effect when subversion
> is asked to maintain very large trees. Lets say there are
> a quarter of a million files. That means at the very least,
> assuming a single 4-byte integer is used for each file
> as its "pointer", 1 meg/mabyte (give or take a teeny bit)
> per tag. If you want to make weekly, intra-weekly or
> possibly even daily tags, this can get very expensive
> very quickly.

Kean, I fear you don't understand how subversion's "cheap copies" work
at all. You need to read our Design document, in our file-sharing area
of the website.

To make a branch or tag in the repository, we create ONE new directory
entry somewhere. That's it! The directory entry points to an already
existing inode, which in turn holds a whole tree of inodes. Think of a
cheap copy as being like a hard link in Unix. Just a new pointer to a
tree that already exists. The size of the tree is irrelevant. That's
why it takes O(1) (constant) time to make a tag. Compare that to CVS,
where you have to put new tag info into *every* single RCS file, which
is O(N).

>
> How about this. Since there is always just a single version
> that the tree is at at any given time, (lets say when I
> make the tag its at version 3261). If I was to use the
> yet-to-be-written "svn tag my_release_tag_name", it
> could use a single database record in an SVN specific file
> at the root of the tree that simply records the current
> tree version. Thus if I ever check out my_release_tag_name
> it knows I really mean release 3261. This then limits the
> data required to store the tag to 4 bytes for the revision
> number and however many bytes the symbolic tag name is.
> This also *HAS* to be a quicker operation than directory
> copying, no matter how fast a directory copy is.
>

You're describing the exact same idea here -- the "cheap copy" -- just
reinvented in a slightly different way. :-)

> My other concern is with the "hidden cached copies of
> every file" scheme. For something the size of Apache,
> and subversion, maybe even something meatier like X11,
> that may be OK, but when your source tree is over 3G
> in size, you now double that to 6G. That's a huge hit.
> Can we at least open up a discussion about possibly
> rethinking why the cached copy is needed? Is it THAT
> important that you can revert a file on an aeroplane?
> Wouldn't keeping a simple CRC or even MD5 hash of the
> file to be able to *detect* changes suffice? Or at
> least give the svn repository manager the option of
> setting up his respository that way. Of course the
> problem becomes bigger when someone in the military
> decides to use subversion one day (aint that a pun?)
> to manage their 40G ADA repositories.

This discussion has happened many times. I think we have plans to make
the cached copies optional someday, after we release 1.0. Greg Hudson
has plans for this... see issue #525. Greg? Details?

Really, our rationale (two years ago) was that disk space grows FAR
faster and cheaper than network bandwidth. So when given a choice, we
chose to optimize for the network. Having cached copies is nice from a
network standpoint: you can view and revert your changes without the
network, and the client can send small diffs during commits.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Sep 25 03:21:10 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.