[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Another working copy library

From: David Anderson <dave_at_natulte.net>
Date: 2007-01-17 08:36:58 CET

I've been kicking the thought around for a while now, so I'll get it
out here in the open.

I think we all know about the "organic" growth of libsvn_wc. As more
large projects like gcc or KDE adopt Subversion, they are starting to
also run into scalability issues with the working copy library that
cannot be resolved easily (not to mention companies that have groaned
a little about this).

The ones I can think of right now:

 - Having to crawl the entire tree on most operations to compute local
changes. While not too bad for a vast majority of users, large trees
take a long time to crawl, and don't even get me started on large tree
+ nfs.

 - Storing metadata all over the place. This is by design, to allow
for working copy severability. It's a nice feature, but it's many more
inodes than is reasonable on large trees. And it's another component
of having to recrawl a huge tree for most operations.

 - Text-base storage. Being able to forego these, or even fetch them
on-demand, has been a feature request for a long, long time. It
doesn't look to me like we'll be implementing this any time soon,
because of the state of libsvn_wc.

 - Doesn't play well with other commandline tools. When I do find or
grep runs over a working copy, I always have to pipe that through
`grep -v .svn` to filter out all the dupes. The tool still has to
crawl twice the number of inodes and output twice the actual amount of
data. And yes, I'm sure there is a nifty hidden switch in both find
and grep that would let me exclude this intelligently. I'm sure I can
find another tree-crawling tool that we equally break and that doesn't
have an exclusion capability.

So, basically, I think that our working copy design has worked okay
for most people, but it's now shown its limits, and it might be time
for a change.

So, I want to break libsvn_wc.

Okay, now, calm down, and read through before killing me.

I've been thinking about an alternative to libsvn_wc. The semantics of
this alternative library would be slightly different, enough that it
would not be perfectly compatible with the existing libsvn_wc. This is
why I propose this as an alternative library that follows as much as
possible of the current svn_wc.h API, that would be selected/enabled
at runtime, using a dynlib mechanism similar to the one we have for ra
and fs backends. From here on, I'll refer to this alternative
implementation as libsvn_wc_sqlite.

libsvn_wc_sqlite stores all the metadata for a working copy in a
*single* SQLite database. This sqlite database is located in a .svn
subdirectory inside the root of the working copy. So, for example, if
you were to check out the svn trunk from svn.collab.net, you would
have trunk/.svn containing wc.db (and probably some other very
lightweight stuff, like a wc version file). There is no other .svn
directory anywhere else in the working copy. When you invoke an svn
command that needs to look at the working copy, libsvn_wc_sqlite walks
back up the tree from the cwd until it finds a .svn directory, and
uses that metadata for the entire tree rooted at that directory.

This preserves working copy portability, and breaks working copy
severability. That is, you can still move an entire checkout ('trunk'
in my previous example) somewhere else and have a working WC, but you
can't move a subtree out (say 'trunk/subversion') and have that be a
functionning working copy for that subtree. In fact, doing the latter
would have the effect of exporting that subtree: no metadata, just the
files.

I have never really used WC severability, but I understand there are
use cases, and more importantly, users of this feature. This would be
a first API exclusive to libsvn_wc_sqlite, something like
svn_wc_sever(), which takes a subtree of a WC and makes into its own,
standalone WC by creating a .svn/wc.db there and entering the relevant
metadata from the parent database. I haven't worked out the exact
behavior yet from the user's POV, but it would therefore mean
something along the lines of `svn sever wc_subdir; mv wc_subdir
somewhere_else`, instead of the current `mv wc_subdir somewhere_else`.

Another thing I'd very much like is to completely eliminate all
implicit tree crawls. The metadata is the working copy, unless the
user requests a forced crawl to update metadata for some reason.

This implies telling Subversion about all operations on versionned
data. We already do that for all operations, except for edits. I'd
like to change that. libsvn_wc_sqlite checks out the working copy
entirely read-only, and you have to tell svn (through something like
`svn edit file`... Yes, I have been using perforce lately) that you
are touching it, at which point it'll record that in the metadata and
flip the file to be writable.

This behavior is off by default however. The default is to crawl the
subtree rooted at cwd to work out what was edited, and to sanity check
metadata as you go. An option passed to svn checkout makes all WC
files read-only, and relies solely on the metadata to operate on the
wc, unless a particular operation forces a crawl.

Text-bases now. By default, they are stored in the metadata sqlite
database (or maybe in a separate text-base sqlite DB alongside the
regular metadata. Details.). I would however like to have a clear line
drawn in the internals of libsvn_wc_sqlite, where we could add other
behaviors in the future. Say, no text-bases and fail all operations
that require them for ultra lightweight working copies, or no
text-bases but retrieved via the ra api when needed (which opens the
way for webdav caching proxies to work their magic).

I think that libsvn_wc_sqlite addresses the issues I pointed out at
the beginning of this mail: tree crawls are minimized, inode count
goes way down, commandline tools don't find text-base dupes all over
the place, and we have a clear internal API where we can handle the
text-base storage problem cleanly. And, hopefully, most operations are
reduced to an SQL select statement, which can be blindingly fast if
the database is indexed properly.

I'm not claiming that it is perfect, it is a different tradeoff on
various points. I do think, however, that it is worth it.

You're now all invited to shoot this idea down with sensible
arguments. I will now go and make peace with myself while you assemble
the firing squad :-). Oh, and I am willing to attempt to put
changesets where my mouth is, this isn't a rant calling for someone
else to do it.

- Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Jan 17 08:37:09 2007

This is an archived mail posted to the Subversion Dev mailing list.