[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: issue #1573: fs deltification causes delays

From: <pmarek_at_users.sourceforge.net>
Date: 2003-11-06 08:34:00 CET

I have to say first that I'm not really familiar with the way svn handles
deltification now - so please be patient and just say when I'm telling stupid
things. :-)

> > There are various proposed solutions in the issue. But for now, I'd
> > like to talk just about solutions we can implement before 1.0 (i.e.,
> > before Beta, i.e., before 0.33 :-) ). The two that seem most
> > realistic are:
> >
> > 1. Prevent deltification on files over a certain size, but create
> > some sort of out-of-band compression command -- something like
> > 'svnadmin deltify/compress/whatever' that a sysadmin or cron job
> > can run during non-peak hours to reclaim disk space.
> >
> > 2. Make svn_fs_merge() spawn a deltification thread (using APR
> > threads) and return success immediately. If the thread fails to
> > deltify, it's not the end of the world: we simply don't get the
> > disk-space savings.
>
> 3. Never do deltification of any sort in the filesystem code, and
> create an out-of-band compression command that can be run as a
> post-commit hook.

Another solution, which may not be done by 0.33, would be the following:

If we trust that there'll be no hash-collisions (in SHA or MD5 or whatever -
which may not hold true [1]) then we'll just save the hash of blocks of data.
The boundaries are determined by having a rolling CRC (see also [2] ), and a
boundary is where eg. the last 14bits of the crc are zero.

So we'll get a (data-based) list of (crc, hash, start, length) blocks, which
we then compare against the "new" file.
In my upcoming perl-module "Digest::Manber" I take another value as well - the
crc prior to the boundary.
So we would have eg 128bit hash, 32bit CRC, and length information to compare
for each block, which should make synchronisation faster - we don't have to
compare two full files against each other, but can take a list (probably
sorted by hash).

I don't exactly know what is implemented today - but maybe that would make
deltification faster (at the expense of harddisk space, of course).

 
[1]: "An analysis of compare-by-hash" http://www.nmt.edu/~val/review/hash.pdf
[2]: "Finding Similar Files in a Large File System"
http://citeseer.nj.nec.com/manber94finding.html

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Nov 6 08:35:06 2003

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.