[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: issue #1573: fs deltification causes delays

From: Jack Repenning <jrepenning_at_collab.net>
Date: 2003-11-04 18:51:09 CET

At 10:43 PM -0600 11/3/03, kfogel@collab.net wrote:
>I'd like to discuss possible solutions to issue #1573. From the
>issue's description:
>
> If you add 5 bytes to a 256 meg file and commit, it takes many
> minutes for the svn_fs_merge() to return success, because it's
> deltifying the previous version of the file against the new
> version.

The discussion on this point seems to be focused on the individual
user experience. That's an important point, I don't mean to derail
that--and certainly "the operation fails on timeout" is a very
important individual-experience issue! But I'm also concerned about
the performance impact on other users of the system: if this
operation is so lengthy and resource-intensive, isn't it also
clobbering the system for all other uses? A large site needs to
support multiple SVN users doing various things at any one time, and
probably other stuff as well. What are the implications of these
ideas on total-system impact?

I rif on that a bit:

> 1. Prevent deltification on files over a certain size, but create
> some sort of out-of-band compression command -- something like
> 'svnadmin deltify/compress/whatever' that a sysadmin or cron job
> can run during non-peak hours to reclaim disk space.

The idea of rescheduling to off-peak hours is a good-citizen kind of
thing. But in these days of global development, there often aren't
any "off-peak" hours. What "not quite so peakish" hours can be found
are generally over-subscribed with other admin activities already.
And the relatively few users doing their work during whatever hours
you choose to call "off-peak" are typically not happy with what they
perceive as ghettoization. So sysadmins of large sites are likely to
be very cool to this idea.

The idea of batching the process is a good-citizen kind of thing,
primarily because batched processes typically execute at reduced
priority. But this sort of arrangement is subject to catastrophic
failure: if the backlog grows enough, then a new batch might be
launched while an earlier one is still processing. This is tricky to
design for: you can't let them both begin processing the same files,
that's wasted energy; you probably don't even want them both running
at the same time, that's twice as much of this supposedly-unobtrusive
processing competing with the foreground work. Yet, you can't simply
have the second thread quietly defer to the running thread and die,
because this collision might actually arise from a bug in the code,
that causes it to hang, or abort leaving deceptive droppings, or
something along those lines.

Work in this direction would need to deal with these matters. Have
you had any thoughts along these lines?

>
> 2. Make svn_fs_merge() spawn a deltification thread (using APR
> threads) and return success immediately. If the thread fails to
> deltify, it's not the end of the world: we simply don't get the
> disk-space savings.

This approach is, so far as I can see, completely focused on the
individual-user problem, and wholly unhelpful for the whole-site
problem. While, as I say, I agree that the individual-user problem
needs to be addressed, so does the whole-site problem.

>I assume that (1) would involve a repository config option for the
>file size.

There might be other tunable parameters as well. For example, based
solely on experiments like trying to ziip and already-zipped file, I
suspect that deltification of certain file types is both unusually
expensive and unusually unproductive (zip files, for example).
Encrypted files are even scarier, since it's an explicit goal of
crypto that compressing the exact same file twice must produce a
completely dissimilar ciphertext. I posit a class of files or which
deltification is unoptimal (perhaps actually deleterious). Who are
the deltification gurus on the list? Has this question been
considered?

-- 
-==-
Jack Repenning
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
o: 650.228.2562
c: 408.835.8090
f: 650.228.2501
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Nov 4 18:51:54 2003

This is an archived mail posted to the Subversion Dev mailing list.