[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

RE: compression

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Wed, 30 Jun 2010 15:47:15 +0300 (Jerusalem Daylight Time)

(fixed quoting)

Edward Ned Harvey wrote on Tue, 29 Jun 2010 at 23:57 -0000:
> Daniel Shahaf [mailto:d.s_at_daniel.shahaf.name]
> > Edward Ned Harvey wrote on Tue, 29 Jun 2010 at 07:15 -0000:
> > > Some people are seeing 20min commit times on changes they could have
> > > copied uncompressed in 1min.
> >
> > How do you know how long the commits would have taken with compression
> > disabled?
> Without svn, I "cp" or "cat > /dev/null" the file (after reboot, with cold
> cache.) So I see how long it takes to do the actual IO. And then I
> benchmarked several compression algorithms on it (lzop, gzip, bzip2, lzma,
> 7-zip) with warm cache, so the timing is 100% cpu.
> These results are corroborated by the fact that the users are sometimes
> competing against each other (bad performance) and sometimes they do an
> update or commit while nothing else is happening. If they're lucky enough
> to do their update/commit while the system is idle, it takes ~60 sec. If
> two people are doing something at the same time ... it seems to scale
> linearly, but the more collisions you have, the more likely you are to have
> more collisions.

More precisely: the more concurrent operations going on, the more likely
it is for collisions to happen.

> I've had the greatest complaints for >15min commits.

So, a commit takes 1min when the server is idle, and 15min when the
server is busy.

> So far, I've greatly improved things by just adding more cores to the
> server. But I wouldn't feel like I was doing a good job, if I didn't
> explore the possibility of accelerating the compression too.

To be honest, I'm not sure I followed. But, never mind; for the
remainder of this reply, I'll just assume that indeed CPU is the problem
(as you argue above).

> > > As far as I can tell, there is no harm in doing this. When data
> > > is read back out ... If the size matches, then it was stored
> > > uncompressed, and hence, no uncompression needed. If the size is
> > > less than the original size, then it must have been stored
> > > compressed, and hence uncompression is needed.
> >
> > A compressed file may or may not be shorter than the original file.
> >
> > You may not know the size/length in advance.
> The way things are right now, svndiff, zlib_encode() take a chunk of data,
(in svndiff.c)
> performs compression on it, and writes (a) the size of the data, and (b)
> whichever is smaller: the data, or the compressed data.

Note that the correctness of zlib_decode() depends on this check being
done by the encoder.

> Later, svndiff, zlib_decode(), reads the size which zlib_encode() wrote,
> reads the data which zlib_encode() wrote, and if the size doesn't match,
> zlib_decode() will decompress the data, to get a chunk of data whose size
> does match.

Due to the check during encoding, the "does size match" check is
effectively a "was this compressed" check.

> > I don't like the idea of getting a stream and not *knowing* whether or
> > not its compressed.
> This is the way things are right now. zlib_decode() doesn't know if it's
> compressed or not, until it checks the size.

Effectively it queries the header of a data for a "was this compressed?"
bit. Which is as okay as it could be...

To try and guess that bit from the data itself is, IMO, wrong. To get
that bit from the header is okay :-)

(I may have misunderstood your original mail; apologies if so)
Received on 2010-06-30 14:47:09 CEST

This is an archived mail posted to the Subversion Dev mailing list.