[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC] issue 2286 - space saving in the repository

From: Max Bowsher <maxb1_at_ukf.net>
Date: 2006-05-20 16:47:07 CEST

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ph. Marek wrote:
> Long-living branches with relatively sparse changes which get often updated
> from the trunk will have most of their files identical with the trunk.
>
> That wastes storage space, as the branch has only the originally "copied"
> nodes shared, but each merge produces a new file, which (although
> possibly identical to another file on trunk) needs the storage space for
> its delta.

Correct. A corollary is that when diffing a branch against trunk, lots
of unnecessary comparisons are being made for files which are the same
on both branches.

> How can this be solved? Any solution from the client side is messy.
...
> the solution must be based in the repository itself.

Yes.

> This means a repository format change, with a corresponding load/dump-cycle.

Or does it...? :-)

We already have a prior case in which additional tables were added to
existing repositories without requiring a dump/load: the introduction of
locking in 1.2.

I think it may be possible to pull the same trick here.

> From subversion/libsvn_fs_base/notes/structure:
> FILE: how files are represented.
>
> If a NODE-REVISION's header's KIND is "file", then the node-revision
> skel represents a file, and has the form:
>
> (HEADER PROP-KEY DATA-KEY [EDIT-DATA-KEY])
>
> where DATA-KEY identifies the representation for the file's current
> contents, and EDIT-DATA-KEY identifies the representation currently
> available for receiving new contents for the file.
>
> In my knowledge the DATA-KEY is always numeric, or at least base-36.
> So a simple way to allow hardlinks in the repository would be to
> prepend a string, eg "hl-" (which can never be in normal data-keys),
> to distinctly detect a hardlink.
> (Or a field is inserted to tell whether it's a normal DATA-KEY or a hardlink.)

Hang on a moment: in real disk filesystems, once a hardlink has been
made, there is _no_ evidence which was the original filename and which
was the linked filename. I see no reason why the same should not be
true for Subversion. In both BDB and FSFS repositories, the history
storage and content storage are very decoupled, with nodes referring to
their content using only 1 (BDB) or a few (FSFS) numbers - numbers which
could likely simply be duplicated when storing a new node-revision
containing the same content as a previous one.

> Furthermore it is necessary to get a new index, going from the MD5-checksum
> of a file to the representation.

Yes. A new table (BDB), or some structure similar to the way locks are
currently stored (FSFS) should do the trick.

> | Changes on the client-side
> +----------------------------
>
> As this transformation is completely hidden within the repository code,
> clients need not see any change. Not in the API, not in the behaviour.
>
> But it might be very useful to allow clients to say "make now a hardlink
> to that data", because if a client sees a hardlink in its working copy
> it could say "make a new node, with these contents" and possible avoid
> spooling many MB over the wire.

Possible, I suppose. But a completely separate project that could be
discussed and worked on without reference to the repository changes
proposed above.

> The traditional way is to make a new API, which includes more parameters
> than the old, and supersedes this.
> This requires not only changes in the client (which would be necessary,
> to let the client detect a possibility for hardlinking), but the repository
> too and the RA-layers inbetween.
>
> That's the clean way, but with a lot of work.
>
>
> Another variant is as follows:
> The current svn_revnum_t is a "long int", which is 32bit or 64bit, depending
> on platform.
> The most active project I know is kde. It's repository has now several
> hundred thousand revisions, and gets about ten thousand new each month.
>
> But assuming 10 000 commits each month, when the 32bit unix-time_t will be
> invalid (in 2038, and with it most software), the kde project will only
> accumulate a mere 3 840 000 commits more.
>
> So a 32bit revision number won't even be needed for them; a 22bit number could
> hold their 4 million revisions.
>
> I therefore suggest using the high bits of the copy_revision parameter
> in delta editors' add_file/add_directory-calls for some flags.

The Subversion project prides itself on doing things right, and coming
up with sensible, *maintainable* designs. There is NO WAY something like
the above disgusting hack would be met with anything but scorn and vetos.

Max.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (Cygwin)

iD8DBQFEbyvrfFNSmcDyxYARAtMgAJ4v26G2LAI29coFbo1CRMERj5jV7gCgig2p
Ua7fp4nOIR0wSDvfku3X494=
=06g3
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat May 20 16:47:38 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.