It looks like our text storage policy is a bit naïve, then. In this
case, trunk has only 8% of the fulltexts (and only 4% of the total size)
in the whole repository. That's really horrible. Of course, GCC is not
your run-of-the-mill open source project, but getting GCC to adopt
Subversion would be nice, and it won't happen it we have this kind of
overhead. There are other interesting projects out there that probably
have similar complexity.
I wonder if we could introduce some sort of total version ordering
within a node, so that we could have _one_ fulltext per node (we can, of
course, but it's not obvious that this is easy to do in 1.x). These are
all BDB-specific musings, of course; I doubt FSFS would scale well to
repositories of this size, except for the more size-efficient text
storage, of course.
Thanks for doing this analysis. It's exactly the sort of data point we
need. Out of interest, do you have any idea how many of those fulltexts
are directory representations? I suspect it could be a significant amount.
Tobias Ringström wrote:
> I've converted the gcc/gcc directory of the gcc CVS repository using
> cvs2svn.py. That part of the repository is 1.2 GiB, has 19934 active
> and deleted files, 404014 CVS revisions, 911 tags, 82 branches. 1308
> files are bigger than 100 kiB, and 134 files are bigger than 1 MiB.
> The dumpfile is 37.4 GiB, and the resulting Subversion repository is
> 5.6 GiB and has 54330 revisions. A lot of that size comes from
> inefficient copies made by cvs2svn.py, but the size of the fulltexts
> do not, and their size is substantial.
>
> Using code from Max Bowsher, I've written a tool to analyze the size
> of the fulltexts in the repository and where they are used. The tool
> only counts unique reps, so there is no double counting. (In other
> words, if a file is copied, all copies will refer to the same rep, but
> it will only be counted once by the tool.) It is only when a change is
> commited to a file that a new unique fulltext is created. cvs2svn.py
> does not generate unneccessary commits on branches, so those fulltexts
> would be there even if the gcc team would have used Subversion from
> the start. They have nothing to do with cvs2svn. I've attached the
> tool so you can play with it and verify it's correctness.
>
> At the end of this email is a list of the size of the fulltexts for
> all tags and branches. Tags and branches without fulltexts are
> omitted. The amount of fulltexts used by tags is very small as
> expected since they are simple copies. The reason three of them show
> up in the list below at all is because they share their reps with
> branches, and they happen to be counted on the tag by the tool, and
> the reps on the branches are considered duplicates. It would be more
> fair to consider the tag reps as duplicates, but it's not a big deal.
>
> Many branches have had a long life, and changes have been merged
> repeatedly from trunk. The effect of such merges is that a lot of
> files on the branches are changed, i.e. new fulltexts are created. I
> think that is a common pattern, and it will make the repository grow
> quite a bit.
>
> I hope this info will be useful by someone. I've started to dump and
> load the repository into fsfs, but it's going to take a while. The
> dump alone took over seven hours (on a very fast machine).
>
> /Tobias
-- Brane
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Jun 30 09:36:17 2004