[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Minimizing repository growth when large files change....

From: Peter Valdemar Mørch <swp5jhu02_at_sneakemail.com>
Date: 2004-12-28 21:28:14 CET

Hi there,

I'm trying to figure out how to store large ASCII (7-bit?) files in
subversion in the most space-efficient manner.

My test data is the Contents-i386.gz from a Debian distritution
http://ftp2.de.debian.org/debian/dists/sid/Contents-i386.gz
(that contains a list of what files are in what packages)
I want to track this file over time with subversion...

It started out as a question about whether to store the Contents-i386.gz
in svn or unpack it and store Contents-i386 instead. I thought that
since the diffs were small, then storing Contents-i386 would be best
since the initial file would be big, but the diffs would be small..

But the repository *explodes* in size when I try this...

I have the file in two versions. Both are about 8.8 MB .gz and 122 MB
raw. They contain about 1.6 million lines and about 1.5% = 25500 (17000
+ 8500 for each direction) of these lines change between the two
versions. A `diff` between the raw files is 2.1 MB.

But I found that the repository grew a whopping *285.54* MB!!! Thats 12k
for each line of diff and 135 times the size of storing the diff output!

Or 1.5% lines changed =>
    repository grew 227% (or 308% if I store the .gzs)

Whats up?

Are there any good pointers on how to store these large text files and
track their changes? Should I store .gz or raw files?

If 25000 lines change every day, and that results in 300MB repository
growth, that quickly becomes unmanagable... (There are multiple of such
files...)

Peter

----------

I tried creating a fresh repository containing first one version, then
the other for both the .gz and the raw files. Here are the repository
sizes in MB when this is done:

                    .gzs raw files
First commit: 11.25 125.81
Second commit: 46.21 411.35
Rep Growth: 34.96 285.54
Rep Growth Ratio: 308% 227%

Repository size vs.
Sum of file sizes
after 2nd commit: 263% 286%

The two test files can be found here:
http://demo.capmon.dk/~pvm/svnLargeTextFiles/

-- 
Peter Valdemar Mørch
http://www.morch.com
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Tue Dec 28 21:30:57 2004

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.