[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Svn 1.9 repository 20% bigger than svn 1.8 repository

From: Stefan Fuhrmann <stefan2_at_apache.org>
Date: Thu, 28 Jan 2016 19:37:38 +0100

On Thu, 28 Jan 2016 11:54:14 +0200, you wrote:

> I have a svn 1.9 repository, created with svnsync, that has ~150000
> revisions and size about 45 GB.

300kB/rev is quite large, like >1 MB of changes before
compression - on average. Are these office documents,
large xml / html files or simply many files per commit?

> Due to some issues in svn-all-fast-export I
> wanted to have svn 1.8 version repository so I downgraded it by doing
> svnadmin (v 1.9) dump /svnadmin (v 1.8) load cycle. I was surprised that
> the size of v 1.8 repository is "only" 37.5 GB
> I tried to compare content of db\revs folder: some files are bigger in 1.8
> repo, some in 1.9 repo.

For the record: you already said elsewhere in this
thread that you used 1.8 to create the 1.8 repo and
1.9 for the 1.9. I also assume standard settings
as in "no fsfs.conf tweaks".

> Now I'm wondering:
> 1. Is such size increase expected for 1.9 repository? I read that 1.9 was
> aimed at speed optimizations, but 20% size increase compared to 1.8 sounds
> pretty big...

A 20% plus is definitely unexpected, +/-5% being a
more typical number. It is not entirely implausible,
though. Here is how 1.9 differs from 1.8:

* 1.9 adds "index" data to the rev / pack files,
   allowing for slightly shorter data elsewhere.
   The typical net effect is +5% in size.
* 1.9 adds some padding at the end of each block
   (64k boundary by default) to avoid parsed data
   crossing block boundaries. Net effect typ. +1%.
* 1.9 will use skip-deltas between shards where
   1.8 would still use "linear" deltification.
   Net effect typ. +2%
* 1.9 will store deltas against very small files
   or directories. Net effect typ. <1%

* 1.9 now supports representation sharing for
   node properties. Net effect typ. 0..-5%.
* 1.9 now supports representation sharing when
   committing the same data to multiple paths /
   branches within the same revision.
   Net effect typ. 0..-5%.

The theme behind these changes is I/O reduction:
Maximize data sharing, enable reordering of repo
data upon pack and avoid "pointer chasing" for
small pieces of information.

> 2. Or is my "dumped and reloaded 1.8" broken somehow? How could I verify?
> (dump revisions one by one and compare? Or is there any better way?)

There is a simple way to compare the "content size"
your repositories. Run the 1.9 svnfsfs tool on both:

svnfsfs stats -M 1000 /path/to/repo > /some/output/path

It basically reads the whole repository, groups and
aggregates the item sizes and produces a long report.
Number of changes and node revision should be more
or less (exactly?) the same. If they are, you'll
be good.

"Representation" size is where the numbers will differ.
Looking at the differences in detail, you should be able
to pin down one or two file extensions that account for
most of the increase. It would be interesting to learn
what is special about them ...

-- Stefan^2.
Received on 2016-01-28 19:36:50 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.