RE: svn and "mostly-binary" repositories

From: Sander Striker <striker_at_apache.org>
Date: 2001-10-01 18:23:14 CEST

> > > - Efficient file comparison. When deciding wether a local
> > > file is in sync with the repository, use md5sum instead of
> > > fetching and diffing.
> >
> > What we do at the moment is store a pristine copy of each
> > working copy file in the admin directory. This allows us to
> > send diffs across the wire for commits, and to determine
> > whether a file has been modified by looking at mod times and
> > performing a local diff if they're different. Unfortunately,
> > it also comes at a 100% space penalty in the working copy; a
> > checkout of a 1GB source directory will come in at 2GB.
> >
> > In theory, we will make it an option in some post-1.0 release
> > to omit the pristine copy in the working copy, and instead use
> > hash checksums of some kind to decide whether a local file is
> > in sync with the repository. And, of course, to do a commit
> > the client would have to send the new complete text rather
> > than send a diff.
> [snip]
>
> As I understand it, rsync works basically like so:
>
> The server and the client both chop the file into blocks. They
> both then compute checksums of all the blocks. One then sends
> its list of block checksums to the other. Then the blocks that
> are different are transferred.
>
> i.e. it's more efficient than just transferring the whole file,
> and you don't need to keep a copy of the original to see what's
> changed.

That is not really an accurate description of rsync.
- The client chops the local file into equal sized blocks
   and calculates 2 checksums, md4 and adler32, it sends
   those to the server
- The server moves a block sized window over the file
   on the server side and calculates the adler32 checksum
   for each offset. When an adler32 match occurs with the
   client, the md4 hash is compared. On a match, the index
   off the block is noted as a token and the window shifts an
   entire block size. If there was no match, the window shifts
   one byte, noting the current byte. All this data is sent
   to the client.
- The client recreates the file by concat'ing blocks (identified
   by the tokens, which can be simple indexes in the checksum
   table) and bytes.

That's it. The reason why rsync is a pretty complex piece of
code is because directory trees are rsync'd over aswell, and the
network protocol is also squeezed to using every bit available.
That is the reason why the rsync network protocol was never
presented as an RFC.

I think the rsync protocol might have some place in svn, but
I don't think anyone is going to look into that until after 1.0.

> Btw, there's a librsync that might be useful for subversion.

Yes, but it is YAL (Yet Another Library), and the concept of
rsync is much more important than the implementation asis in
librsync IMHO.

Sander

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:36:43 2006

This message: [ Message body ]
Next message: Timothee Besset: "Re: svn and "mostly-binary" repositories"
Previous message: Brian Behlendorf: "Re: poor Brane"
In reply to: Michael Wood: "Re: svn and "mostly-binary" repositories"
Next in thread: Timothee Besset: "Re: svn and "mostly-binary" repositories"
Reply: Timothee Besset: "Re: svn and "mostly-binary" repositories"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]