[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: ideas to make svn update faster.

From: Thomas Zander <zander_at_kde.org>
Date: 2005-05-07 22:19:52 CEST

On Saturday 07 May 2005 20:47, Branko Čibej wrote:
> Thomas Zander wrote:
> >One of the things I notice is that svn update is not faster then cvs
> > update, which is contrary to expectations since there should be a
> > global tree revision, so it should be faster then the cvs which has a
> > revision per file.
>
> I can't imagine how you reached that conclusion. How the revisions are
> numbered has no bearing on update speed.

The initial idea is not so strange; actually everyone I talked to that knows
about svn's global numbering came to the same (apparently incorrect)
conclusion.
If you already know the global version number you don't need to read the
version for each file and dir. Systems like darcs work changeset based and
only 1 lock file is needed and no matter how big the repo is; the time to
update is always the same. I expected svn to have that same advantage.

> >I was told on #svn that this is due to the mixed revisions stuff.
>
> Yup, and the fact that SVN has to do a lot more work than CVS, because
> of directory versioning.

The slowness is not the processor or the network, its the amazing amount of
disk access that for just about all kde modules means I'm totally going
through the diskcaches on each update/commit or whatever.
Less writes seems to be a good start. Lets get to the less-reads later.
Shall we?

> > [2] Which
> >I fully understand. Looking at strace output I notice that svn could be
> > a lot faster (do less writes) if svn was to be more optimistic about
> > version numbers.
>
> With "more optimistic" == "wrong", unfortunately...

No its not; don't dig your heels in the sand just yet; please. But, please
tell me exactly what usecase I missed where things go wrong. Thanks.

> >kdelibs has ~8800 files and 378 dirs. At any time maybe 10 files have a
> >different version then the rest (hell; let it be 10%). That means that
> >around 370 .svn/entries files have been written with the only change
> > being a new version number in the name="" entry that is equal to just
> > about all the other dirs in the project.
> >A simple optimalisation would be to remove the directory-version number
> > (the one in the xml entry-tag with 'name=""') when it has the same one
> > as the parent dir.
>
> Have you actually measured what percentage of update time it takes to
> write those 378 entries files, or are you simply guessing that this is
> the bottleneck?

What? Don't you think the amount of writes is a problem, then? The work done
on each update _is_ huge for a project like KDE (where kdelibs is just a
subdir; a normal update will easilly go to 200000 files).
If you dd the profiling; thats fine. Lets work on that; if you didn't then
what about working on this part, now, eh?
Statting less files etc comes later.

> >Its probably not goint to be as simple as that (since you update subdirs
> >seperately), but I'm pretty sure that a lot less xml's have to be
> > written if you follow the route that the normal state is a dir having
> > the same version as its parent. Only when that fails do you need to do
> > extra work. Being optimistic about version changes; I'd call that.
>
> Well, the first question that pops to mind is, how do you tell that the
> equal-version assumption is wrong, unless you record the dir's version
> number?

Sure you record it; but only for the dirs/files that actually have a
different version number. (and svn already does that partly)
Don't think so black and white, here.
As I said; you read the entries files as normal, but you don't have to
overwrite them for each dir if only the global version changed. Since the
resulting xml would be exactly the same.

> >Now; there is probably going to be a lot of opinions on the above
> > subject; and I'd like to point out that svn really needs speed
> > optimalisations; I have seen a LOT of complaints about this issue in
> > the KDE switchover. Remember that if you find the above suggestion
> > technically less-then-ideal.
>
> Certainly SVN needs speed optimizations. But I think you're approaching
> them exactly the wrong way around. The thing to do is to measure where
> the bottlenecks are, and strace is far from enough for that.

Hmm. I'm afraid its not really a secret recipy that if your process is not
taking a lot of cpu and memory, but is reading and writing a lot of files;
then the first thing to look into is to get it to write less files since
writing files is _always_ the slow part of disk access.
But, if you did the profiling part; I'd be happy to compare notes! :)

> >The strace also showed me things like;
> >* the .svn/format file is opened 5 times for each directory.
>
> We know about that, and we already have a (tentative) plan to remove the
> format file and put the format information into the entries file.

Sounds great; good to hear I'm not smoking crack then :)

> >I would think
> >that with auto-upgrades only one (the root dir) should be enough.
>
> That, of course, is again an oversimplification. You can't make
> assumptions about the state of subdirectories in the working copy.

You can only make assumtions if you wrote the things; you make assumtions on
the format of the entries file (and other things) for the plain and simple
reason that svn wrote the file.
So if the upgrading routine of the format of the .svn dir makes sure he
actually _knows_ about the format file afterwards; then yes you can make
assumtions.
There are lots of ways to do this; if you find an old version in a parent
dir you upgrade it and upgrade all child dirs (which are listed in each
entries file) at the same time; and only when everything went fine you note
that in the format file. With this approuch only one .svn/format needs to
be read.

> >* .svn/lock files being created in every subdir is not needed if you
> > check parent dirs that also have a .svn (and maybe the same root).
>
> What you think of as the "root" of the working copy is a figment of the
> imagination. It's quite valid to have two SVN processes fiddle in
> parallel with two subtrees in the WC. A third SVN working from a common
> root of those two subtrees could zap the WC if it didn't try to lock it
> recursively first.

If I type update in foo/bar then the root is bar. If I type update in
foo/bar/baz; then the root is baz. Simple because thats already what you
do now.
The only difference being that you create a whole lot less lock files.
Your example;
consider
a/b/c
a/b/d
One svn is updating c, another is updating d. Effect; one lock file in
c/.svn and another in d/.svn
Then the user types svn update in 'a'.
Effect now; svn: Working copy 'a/b/c' locked
Effect with my proposed change; well, none actually, it again gives the same
problem.

The fact that svn could just skip that dir in the update and only print a
warning is another point. But I won't go there just now.

> >So you create one in the dir you typed 'svn up' in and if someone types
> > svn up in a subdir it will change dir to parent and check for a lock
> > file until it either finds it (in this case it will, and abort) or it
> > will leave the checkout.
> >This will save a _lot_ of file-creation and removal afterwards.
>
> So, you're saying that we should check locks upwards in the working
> copy, not downwards. Interesting idea. I'd not want to guess what
> happens if you have symlinked working copies.

This is the opposite effect of the situation we described above. Same dirs
a/b/c. Only this time the first svn is in the dir 'a'. And while thats
running I start one in thesubdir 'c'.
You expect it to bail out, as it does right now.
So; read my explenation and see how that will do exactly that.
Symlinks are a non issue since svn doesn't follow them anyway in an update.

-- 
Thomas Zander

  • application/pgp-signature attachment: stored
Received on Sat May 7 22:21:09 2005

This is an archived mail posted to the Subversion Dev mailing list.