Malcolm Rowe schreef:
> On Fri, Jan 18, 2008 at 07:54:59PM +0100, Niels Werensteijn wrote:
>> Malcolm Rowe wrote:
>>> On Thu, Jan 17, 2008 at 11:49:32PM +0100, Niels Werensteijn wrote:
>>> Interesting. This sounds similar to a patch that Dan Berlin (cc'd)
>>> produced around the time of the Subversion conference, which used
>>> Rabin fingerprinting (basically a CRC) and a match table.
>> Is this in the mailinglist archive? If so, please give me a hint where to
>> find it? :)
>>
>
> Afraid not. It's on my laptop :-/, and Dan should also have a copy.
>
> Thinking about it a little more, we weren't actually planning on
> replacing vdelta -- we were going to replace xdelta, with something
> faster and/or better performing.
Ah ok. I did take a quick look at the paper you mentioned about the
hsadelta algorithm. It might not even have to be to complex to implement
considering the small chunk sizes subversion uses for delta's (100K).
The paper itself mentions that some of the datastructures could be left
out at the expense of some cpu efficiency, which might not be a big
problem with small chunks.
> vdelta-against-empty-stream is essentially just compression, so I wonder
> if we can just adapt/use zlib in some way (sorry, been a while since I
> looked at this). Though you mentioned that your changes would help if a
> large binary file changed significantly -- why's that? (because as far
> as I'm aware we need to make a decision about what compression/delta
> scheme to use _before_ we see the whole file, which means we can't say
> things like 'use xdelta, or just self-compress if that's better'.
My main concern is cpu usage. My main subversion repository is on a
relatively slow machine. When a new version of the binary is larger
(starting at 100K boundaries) it will have chunks that will also have no
source stream. In that case, for the new chucks, velta is now chosen.
With the patch it is the xdelta variant. The xdelta variant is faster
than vdelta at the expense of size efficiency.
The patch I made is mainly concerned about reducing cpu cost of (part
of) streams that are diffed against an empty source stream, as seems to
happen on a few cases.
To give an example: I want to export a big binary (32MB compressed
windows installer) from my slow VIA cpu powered repository to my
workstation. If I instruct subversion to only export that file,
subversion just sends the whole file verbatem and that takes 2 seconds.
If I instruct it to send the directory where the file is in (and only
that file) it takes a whopping 46 seconds.. thats 2300% slower :) The
bulk of that time is spend in first vdelta and than zlib compression.
Since the binary is already compressed both vdelta(xdelta) and zlib
won't find much to compress :(.
My first idea was to just make 1 insert that inserts the whole chunk,
and then let zlib try to compress it, saving over 50% cpu time. But that
would grow the repository, because most files are compressable by
vdelta, and the repository does not use zlib. So I opted for a faster
diff algoritm.
Ofcourse this is my particular situation. I agree that the best solution
would be to use some sort of real compression scheme. But then this
would have to be taken into account when sending chunks over a network.
svnserve uses zlib to compress data (as said before, I don't know about
DAV). Using zlib on a zlib stream uses lots of cpu time and uses more
bytes. I am not yet formiliar enough with the source of subversion to
make an estimate on how much work it would be to implement "zlib
compression instead of diffing".
Regards,
Niels Werensteijn
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-01-21 15:12:54 CET