[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Size of revs file when deleting lines in a big text file

From: Malcolm Rowe <malcolm-svn-dev_at_farside.org.uk>
Date: 2006-12-11 22:33:05 CET

Hi Martin,

On Sun, Dec 10, 2006 at 02:48:19AM +0000, Martin Scharrer wrote:
> A text file, mbox with mails, with about 5 MB is (already) in subversion.
> I deleted now one email, i.e. a couple of hundred lines, located in the first
> quarter of the file. The diff is about 16kBytes. After checking in this and
> other small changes a detected that the file in the 'db/revs' dir in the
> repository is over 3 MB in size.
> A 'svn diff -rN:M | wc -c' showed me only <0.5 MB.
>

The size of the textual diff against the previous revision and the size
of the delta (which isn't necessarily against the previous revision)
aren't related.

> The result shows me that the 'delete'-revs have very big sizes which are much
> bigger then the resulting diff (e.g. 500k rev with 16k diff)

> I tested different order of deletion/addition to make sure it's not because of
> skip-deltas like mentioned by Rob Hubbard on the users list.
>

The details are below, but in summary:

a. You _are_ seeing the result of our skip-delta algorithm, at least in
   your reproduction script.
b. We're doing the best we can given the way we move through the two
   different files (reference and version; or 'source and target' if
   you prefer), that is, window-by-window. We will always see large
   deltas when you delete or insert content midway through a large file.

So the behaviour you're seeing in your reproduction recipe is partly due
to the fact that we use a skip-list-like method to choose the delta
base. The reason is simple:

a. We choose a delta base based upon the 'version number' of the file
   itself, regardless of how many commits have occurred to other files.
   More precisely, we use the predecessor count, but for simple cases,
   that's the same as a per-file version number that starts at zero.

b. Your recipe commits a 'complete' file (the full file contents), then
   a 'short' one (with some content removed), then the 'complete' one
   again, and so on: complete, short, complete, short, complete, short...

   (You also do commits to other files, but we ignore those when working
   out the delta base).

c. We choose the delta base by taking the predecessor count as a binary
   string and unsetting the rightmost '1' bit. (This is the skip-list
   part). By doing so, we'll _always_ end up with an even-numbered
   predecessor as a delta base.

So because we always use an even predecessor as a base, we'll always use
the 'complete' file, and so all the revisions that want to produce that
complete file end up with very short deltas (ones that are essentially
minimum size: a stream of whole-window copy commands).

You can see this if you remove the calls to the other_revs function
(there's no change in the output) or if you initially commit your
'short' file and then continue 'complete', 'short', ... (you'll see then
that the _added_ files now have larger revisions: everything is now
delta'd against the short file).

You might be wondering why deleting some content ends up producing such
a large delta (c.240k with svndiff0 / --pre-1.4-compatible
repositories). After all, you've _deleted_ data, not added it, right?
And if you try the experiment I described above (reversing the order),
you'll see that the 'deleted' version is comparable in size to what you
get when you add a similar amount of text.

So why does deleting data from the middle of a file (you deleted 9105
bytes from approximately the middle of a 5461065-byte file) produce a
large delta? You hit the root of the problem here:

> and that this
> size depends on the position of the change. The size is bigger for deletion
> nearer to the start of the file and smaller for deletion more on the end.
> e.g. 1.7MB direct at the begin, 660kB at the 50% mark, 2k at the end.

In order that it can deal with arbitrarily large files, Subversion's
delta algorithm deals with each source file as a series of windows, each
of 100k. The delta algorithm reads 100k of source data and 100k of
target data and constructs the delta.

So when you remove 9k from the middle of a file, we end up with the
following:

* each window prior to the removal is simply a single copy command.
* each window after that point contains a copy command for the data
  that's common in both source and target, plus an insertion of about
  9k of data that's been brought forward from the next source window.

That new data (from the next window) is inserted directly into the delta
stream (though in svndiff1 repositories it is at least zlib-compressed),
and appears for each window past that point.

You have a 'short' file that's 5451960 bytes, or 54 100k windows. The
first 27 windows will be simple copies (about 500 bytes in total), and
the second 27 will be dominated by the 9105 bytes that were brought
forward from the next source window. 27 x 9105 is 245835 bytes,
slightly below what I see for each revision's size in an svndiff0
repository.

How can we improve this? The best way would be to increase the delta
window size significantly, though we have some backward-compatibility
issues (and probably memory constraints) to work around in order to do
that. Even so, unless we were able to change the delta logic so that we use
variable-sized windows and resynchronise where possible, we're always
going to take a hit on this kind of change with a window-oriented delta
algorithm.

>
> I now wrote a test script in perl so you can reproduce this easily. The script
> generates a test repository and a testfile and then makes a couple of
> check-ins and prints the resulting sizes
>

Thank you _very_ much for giving us a reproduction recipe: it made it
significantly easier to see what was going on.

Regards,
Malcolm

  • application/pgp-signature attachment: stored
Received on Mon Dec 11 22:33:27 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.