[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

[RFC] issue 2286 - space saving in the repository

From: Ph. Marek <philipp.marek_at_bmlv.gv.at>
Date: 2006-05-15 09:04:14 CEST

Hello everybody,

please see my RFC for issue 2286 below.

Discussion, ideas, question, critic welcome.

Regards,

Phil

| What is this about
+----------------------------

The things described here apply to issue 2286: a kind of shared-storage
for identical files in the repository is needeed, without having any means
to infer such sharing from the outside.
(That is, any of the identical files behaves like a normal entry, only
the storage space in the database is stored.)

The corresponding link to the issue is
    http://subversion.tigris.org/issues/show_bug.cgi?id=2286

Below the term "hardlink" is used for a file content pointer in the
repository, which doesn't affect history. It's also still true that
older contents in the repository can never change; therefore a "hardlink"
to an already committed file makes the new entry an exact duplicate, but
doesn't allow modifications of the old data.
(That's like the "ln" command in unix; this also doesn't allow to say
"make a link, but change line 5".)

 
| Why
+----------------------------

In source code versioning branches are readily used, especially in
subversion, where creating a branch is a O(1) operation.

Long-living branches with relatively sparse changes which get often updated
from the trunk will have most of their files identical with the trunk.

That wastes storage space, as the branch has only the originally "copied"
nodes shared, but each merge produces a new file, which (although
possibly identical to another file on trunk) needs the storage space for
its delta.

 (I currently don't have numbers, as I don't have a good example
  repository with lots of branches and merges.
  The repositories of subversion, kde or gcc might be good candidates -
        but the repository should be available locally, as doing an analysis might
        be prohibitive network-bandwidth wise.
  I have to admit that I got no script for this analysis, too; but see [4]
        for an algorithm.)

That's especially true with big working copies.

In fsvs (fsvs.tigris.org) we're versioning whole machine installations
(each working copy some GB), and if there are several such machines with
similar update strategies (eg. all on debian testing), the wasted space
adds up.

 
| How
+----------------------------

How can this be solved? Any solution from the client side is messy.
- Looking for an identical file in the repository and committing a copy of
  that messes up the history:
        If machines A and B each have a /bin/login, and both update that 5 times,
        then each /bin/login should have its predecessor set *in the same
        working copy/commit path for this machine*, not a copy from
        ../machine-B/bin/login or some such.
        This would completly defeat history tracking.
        The same is true if there's a central place in the repository for the
        "original" file data, and all committed files are just copies from there.
- Doing some tricks with properties (ie. storing the path and revision in
  the repository for the full-text) would solve the space-problem, but
  makes that repository incompatible with all other subversion clients and
        repository browsers.
        History would not be preserved, too.

Following this argumentation (and assuming that no better ideas come up),
the solution must be based in the repository itself.
This means a repository format change, with a corresponding load/dump-cycle.
(But see [1] for some more discussion, and [3] for another note)

From subversion/libsvn_fs_base/notes/structure:
    FILE: how files are represented.
    
    If a NODE-REVISION's header's KIND is "file", then the node-revision
    skel represents a file, and has the form:
    
        (HEADER PROP-KEY DATA-KEY [EDIT-DATA-KEY])
    
    where DATA-KEY identifies the representation for the file's current
    contents, and EDIT-DATA-KEY identifies the representation currently
    available for receiving new contents for the file.

In my knowledge the DATA-KEY is always numeric, or at least base-36.
So a simple way to allow hardlinks in the repository would be to
prepend a string, eg "hl-" (which can never be in normal data-keys),
to distinctly detect a hardlink.
(Or a field is inserted to tell whether it's a normal DATA-KEY or a hardlink.)

This link would simply be followed, hardlinks to hardlinks are not allowed.
Performance-wise we simply add an O(1) operation to data lookup.

As far as I understood the database schema we don't need to use the Copies
table for hardlinks.

Furthermore it is necessary to get a new index, going from the MD5-checksum
of a file to the representation.
A representation currently includes the MD5-checksum; we just have to have
a fast way to find possibly identical files (which can be hardlinked).

I propose to create a new table named "checksums" having the checksums as
keys and values which are the keys of the representations-table.
(Whether MD5 or SHA-1 or something else is used makes no real difference;
we'll have to prepare for a *list* of matching representations anyway.)
(See also [3].)

 
| Changes on the client-side
+----------------------------

As this transformation is completely hidden within the repository code,
clients need not see any change. Not in the API, not in the behaviour.

But it might be very useful to allow clients to say "make now a hardlink
to that data", because if a client sees a hardlink in its working copy
it could say "make a new node, with these contents" and possible avoid
spooling many MB over the wire.

The traditional way is to make a new API, which includes more parameters
than the old, and supersedes this.
This requires not only changes in the client (which would be necessary,
to let the client detect a possibility for hardlinking), but the repository
too and the RA-layers inbetween.

That's the clean way, but with a lot of work.

Another variant is as follows:
The current svn_revnum_t is a "long int", which is 32bit or 64bit, depending
on platform.
The most active project I know is kde. It's repository has now several
hundred thousand revisions, and gets about ten thousand new each month.

But assuming 10 000 commits each month, when the 32bit unix-time_t will be
invalid (in 2038, and with it most software), the kde project will only
accumulate a mere 3 840 000 commits more.

So a 32bit revision number won't even be needed for them; a 22bit number could
hold their 4 million revisions.

I therefore suggest using the high bits of the copy_revision parameter
in delta editors' add_file/add_directory-calls for some flags.

For hardlinking we'd need only a single bit, saying "Use the given
path at the given revision (after stripping the flags off), and make a
hardlink to that instead of a branch/copy".

To avoid making revision numbers negative we could use the second-highest
bit; then code testing for revnum<0 (to check against SVN_INVALID_REVNUM,
which is -1) would still get its test right.
(This shows that the usage of -1 as a flag possibly invalidates full
2G of revision numbers, in subversion and other 3rd party programs!)

 (If we decide to go that route, we could use another bit [2] for
  the "true rename support". Although the rename-branch seems to use
        an additional ra-layer-function to support renames, and this issue
        might need to, too.)

I'm not saying that we have to do it the "wrong" way, just that it would
be *much* less work for the intermediate layers.

If a client requests such a hardlinked file, I'd suggest giving just the
wanted data, without indication that its really hardlinked to some
other space.
Of course, that means that all operations involving multiple hardlinked
files would require more network bandwidth, just like today.

If we're going the changed-API-way, there could be a way to tell the client
"this is the same data as XXX", but I don't think that would help, as the
client normally won't be able to fetch this data without a second session -
and even then the latency would be bad.
And telling where to find the hardlinked data doesn't mean that the client
has access to it - so in most cases it would have to query it.

 
| Addendums
+----------------------------

Ad 1:
Currently for *lots* of files committed at once (magnitude 10 000 and above)
fsfs is not useable. For a commit with N changed file the creation
of 2*N files *in a single directory* is very bad, performance-wise
(at least on the tested ext3 [with directory index] and reiserfs).
BDB is much faster (a factor of 3 AFAIR), so I'll only discuss BDB here.

In BDB there are lists of values stored as values; if such a list would have
an optional value, we might *now* get around a dump/load cycle.
With the next addition of a field this is no longer true, so the advantages
are small compared to the disadvantages.
Summary: I believe we'd need a repository format change, *with*
dump/load cycle.

(I didn't look at fsfs.)

Ad 2:
Or just reserve 3 bits, and define an enum for them - 0 normal copy,
1 hardlink, 2 rename, ...

Ad 3:
As there are now ways to generate files with identical MD5-checksums it might
be nice to support more checksums than only this one; apr already knows
SHA-1, so that would be trivial to support, too. (SHA-256 or some such
might be better.)
Then the header in representations would be expanded to have more than
one checksum (which is easily done in the current BDB schema, as the
CHECKSUM already consists of [type, len, value]).
The search for possibly identical files could always be done via MD5, so
that only this one additional index is needed.

The client should get a way to say "I'd like to get the MD5", "the SHA-1",
or both - the buffer in the callback for update_file would change in length.
Else clients cannot verify that they got the correct file.

Which checksum(s) should be supported by a repository would have to be
told to "svnadmin create" - although adding other checksums later, doing
a re-hashing operation is possible, too.

Ad 4:
A simple algorithm could be to do a "svnadmin dump"/"svnadmin load", with
a small script in the middle which filters any file whose MD5 has already
be seen and replaces it with a zero-byte file. Then compare the repository
sizes.
Or, if the repository is BDB, just retrieve all revisions, and on every
duplicate MD5 add the length of the corresponding string-table-entry
(or entries).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon May 15 09:04:25 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.