[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Svn 1.9 repository 20% bigger than svn 1.8 repository

From: Stefan Fuhrmann <stefan2_at_apache.org>
Date: Sat, 30 Jan 2016 11:02:20 +0100

On 29.01.2016 11:17, Gert Kello wrote:
> > I have a svn 1.9 repository, created with svnsync, that has ~150000
>
> > revisions and size about 45 GB.
>
> 300kB/rev is quite large, like >1 MB of changes before
> compression - on average. Are these office documents,
> large xml / html files or simply many files per commit?
>
>
> The content is mixed. Quite many small, source code commits. But
> office documents and zip archives as well. There are even few
> extremely huge commits, biggest one is 3+GB, one 800+MB and one 500+MB
> (as per revision file size in db/revs folder)

Thanks for the data point, Gert. As a repo backend developer,
I'm always interested to hear about people's usage pattern.
>
> There is a simple way to compare the "content size"
> your repositories. Run the 1.9 svnfsfs tool on both:
>
> svnfsfs stats -M 1000 /path/to/repo > /some/output/path
>
> It basically reads the whole repository, groups and
> aggregates the item sizes and produces a long report.
> Number of changes and node revision should be more
> or less (exactly?) the same. If they are, you'll
> be good.
>
> "Representation" size is where the numbers will differ.
> Looking at the differences in detail, you should be able
> to pin down one or two file extensions that account for
> most of the increase. It would be interesting to learn
> what is special about them ...
>
>
> Yes, number of changes and number of node revision records are
> identical. Number of representation do differ (1.744.149 @1.8 vs
> 1.901.312 @1.9)
> The "nodes total", "directory noderevs" and "file noderevs" numbers
> are identical
So, all user content is there and merely the deduplication failed
(as already being investigated elsewhere in this thread).
>
> The "Largest representations:" sections shows that 1.9 has failed to
> de-duplicate several files (executables in this case)
>
> The "Extensions by number of representations:" shows that all
> extensions have bigger number of representations in 1.9 repo
>
> The size if representations is most increased for .exe and .pdf
> extensions, where .exe causes 5GB increase and .pdf 500MB. Several
> types cause increase ~300MB, "others" have +1GB
>
> The dump/load cycle into 1.9 is finished as well, now it is 36.2 GB
> (less compared to 1.8 which was 37.5 GB). Both 1.9->1.9 and 1.8->1.9
> resulted almost identical repos when comparing files byte by byte (the
> exception is UUID file)... Which makes me wonder if I dumped the same
> rep twice. Too bad the windows cmd doesn't retain command history.

It is very unlikely that you actually (successfully) dumped and
loaded data twice because the number of nodes and revisions
would then be different. However, you might have tried to do
"something" to the repo while to got written by the sync process.
That might have blocked access to the rep-cache.db, ending
all deduplication. But that is pure speculation.

Maybe, we should add some retry logic to the rep-cache.db
access code. After a failure, we might retry after each, say
1000, write attempts.

-- Stefan^2.
Received on 2016-01-30 11:02:26 CET

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.