[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: revision files absurdly large at higher revisions

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Wed, 25 Jan 2012 09:51:47 +0100

On Wed, Jan 25, 2012 at 9:06 AM, Greywolf <greywolf_at_starwolf.com> wrote:
> On 1/24/2012 23:04, Ryan Schmidt wrote:
>>
>> On Jan 24, 2012, at 15:18, The Grey Wolf wrote:
>>
>>> Hello, I'm not quite sure how to properly phrase the subject as a query
>>> term, so if this has been answered, please forgive the redundancy and
>>> quietly point me to where this gets addressed.
>>>
>>> We are using svn at work to hold customer 'vault' data [various bits of
>>> information for each customer].  It has been a huge success -- to the
>>> point where we have over 1,000 customers using vaults.  The checkins are
>>> automated, and we have amassed over 100,000 revisions thus far.
>>>
>>> User directories are created as /Ab/username [where Ab is a 2-character
>>> hash via a known (balanced) algorithm to make location of username files
>>> more machine-efficient].  So we have about 1,200 of these guys, with some
>>> hashes obviously being re-used, no big deal.
>>>
>>> The problem is that, even on miniscule changes, we are finding the
>>> db/rev/<shard>/<revno>  files to be disproportionately large; for an
>>> addition or change of a file that is about 1k-4k, the rev files are at
>>> 100K each.  At lower revisions, we noticed that the rev files are 4k but
>>> have been increasing in size with each shard that gets added, usually to
>>> the tune of 1k/shard.  With so many revisions being checked in at a rapid
>>> rate, we found ourselves having to take production off line for a couple
>>> of minutes while we migrated the repository in question to a larger
>>> filesystem due to the threat of the filesystem filling up.
>>>
>>> The upshot of this is:  Why does a minimal delta create such a large
>>> delta file?  100k for a small change?  What's going on and how can we
>>> mitigate this?
>>
>>
>> It probably has to do with the size of the directory entries, not the
>> changes you're making to the files.
>>
>> If you add a file, that's recorded as a change to the directory. When you
>> change a file, Subversion stores only the changes you made, not the
>> complete new file, and it stores them compressed. However, when you change
>> a directory (e.g. by adding or removing a file or directory), Subversion
>> records a complete new copy of the directory, and I don't know if it's
>> compressed or not. If the directory has hundreds or thousands of items,
>> that will take some space.
>>
>> I don't remember if modifying a file counts as a change to the directory,
>> but adding or deleting a file certainly do.
>>
>> Based on this I would assume you could mitigate the problem by having
>> fewer
>> items in each directory. Create a deeper directory structure from your
>> hash: /A/Ab/username, or even /A/Ab/Abc/username. You should try this out
>> in a testing environment. Either create some test data, or dump your
>> current repository, and then a) load it into a fresh empty repository
>> as-is, and b) transform it into a deeper directory structure using a tool
>> like svndumptool, then load that into a second fresh empty repository.
>> Then
>> see if there is an appreciable size difference.
>
>
> Interesting, to be sure.  Here's some stats.
>
> top level = 2817 entries
> second level = 1..22 entries [depending on which one]
> Some have a third level, most don't; ranges 1..27 entries.
>
> So are you saying that if I add a file /ab/username/file, it's going to copy
> the ENTIRE top level directory in as a delta?

No, every revision stores the entire directory listing of its parent
directories as a full text list, not as a delta.

See issue #4084 [1] for some recent pickup on this problem (it's
always been that way, but recently some more people are looking into
this problem).

AFAIK, Stefan Fuhrmann has recently implemented "Directory
deltification" on trunk [2], so perhaps this will come in 1.8. But
there is still some discussion and testing about the tradeoffs (it's
basically a CPU vs. storage tradeoff: deltifying directories requires
the server to do more work).

[1] http://subversion.tigris.org/issues/show_bug.cgi?id=4084 (FSFS and
BDB store large directories inefficiently)
[2] http://svn.haxx.se/dev/archive-2011-12/0356.shtml and
http://svn.haxx.se/dev/archive-2012-01/0020.shtml (the thread is
somehow broken in two on haxx.se)

-- 
Johan
Received on 2012-01-25 09:52:50 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.