[PROPOSAL] Dealing with large directories/difficult to organize trees.

From: Ben Reser <ben_at_reser.org>
Date: 2005-07-11 00:32:11 CEST

One thing that we have a real problem with respect to Subversion is
dealing with large trees and trees that are difficult to organize.
For example:

If you had a tree like this:

If you need to work on the images in foo and bar and need to have atomic
commits across the two you are forced to checkout the videos
directories.

This also presents problems for people with sets of reulable
modules/components that might go into any number of products.
For example:

Product 1 might need subsystem 1, module 11 and subsystem 2, subsystem
21, module 211, i.e. you'd only want a tree like so checked out:

Again you're forced to check out the entire tree even if you don't need
all of it. In some cases where parts of the tree may be exceedingly
large this may make it impossible to even checkout the trees you need
all at once and forces people to use non-atomic commits.

Another good example of where this problem pops up is if you want to
reorganize a repository. If you want it to happen all in one commit
you'd have to possibly checkout the entire repository. Even if you only
need to have a few sections of the repository checked out to do the
reorganization. To get around this I've seen people actually write
programs that called the RA layers rather than using our client to get
the reorganization done. (Think renaming of some directories in
branches).

We do have a non-recursive checkout option but as documented in Issue
#695 it doesn't really work. The working copy doesn't know that it's
incomplete and things really go to hell in a handbasket if you try to
actually use the working copy. (Just try checking out say r10000 of
trunk non-recursive and then doing a svn up on it to HEAD, it'll fail).

But not only that but there is no way to ask to have directories filled
in so even if we fixed just #695 it wouldn't really completely solve the
problem (I consider #695 only to be that non-recurssive checkouts aren't
sticky, though a lot more than that ended up getting discussed on the
issue). With our current implmentation you either have all of the
subdirectories or you have none of them.

So I see two solutions to these problems. A near term solution that
helps everyone, including the people that just want to do a quick one
off checkout in order to get some atomic commit across two trees. The
second being a solution that we probably can't do until 2.0 but makes
life much easier for people that want to use parts and pieces of a
repository on a regular basis.

The short term solution is to fix Issue #695 and create interfaces to
allow directories to be added to no longer "missing" or to become
"missing".

To fix #695 I propose the following:

a) The .svn/entries file would have entries for all directories no
matter if the checkout was recursive or not.

b) An additional entry field would be added called excluded. All
entries for kind=dir would have excluded=yes or excluded=no. We already
have missing and incomplete used throughout our code for different
things. So excluded would be the term we would use for directories that
are not represented in the wc.

A directory checked out with -N would have all of its dirs marked as
excluded.

The client can then simply not ask for updates on those dirs when
someone asks for an update.

The reason I didn't simply add a recurse flag and leave the subdirs
entries missing from the entries file was to allow for more than the all
or nothing situation that we hav enow.

Once thta is fixed then we can start adding an interface to manipulate
what is excluded. I suggest adding two new client commands:

svn include foo
svn exclude foo

If you run include on a path name marked as excluded it will
fill in that directory. Exclude would obviously take a exclude=no
directory and remove it from the wc (being careful of local
modifications) and mark the directory as excluded=yes.

A newly included directory would always be handled recursively by
default but we could have a -N option on it to alter that.

I also think we should have --include as an option to checkout when -N
is also passed. The directory paths passed to it would be relative to
the top of the wc and would always be added recursively. This would
ease most common situations for people that needed to use these things
on a regular basis. For example in the above functional/product example
you could get the wc you wanted by doing:

svn co -n URL wc/prod_1
svn include wc/prod_1/product/prod_1
svn include wc/functional/subsystem_1/module_11
svn include wc/subsystem2/subsystem_21/module_212

svn co -N URL --include wc/prod_1/product/prod_1 --include wc/functional/subsystem_1/module_11 wc/subsystem2/subsystem_21/module_212 wc/prod_1

Note that the intermediary directories would get made but would not be
handled recursively, only the last directory would.

We could also have a --exclude for when -N is not passed to checkout.
This would make it easy to exclude a large directory that you don't need
(e.g. large media files you don't need).

One alterntative to adding the include command would be to add a form to
checkout that if not passed a URL it would attempt to fill in the
directory (i.e. in the above examples just replace include with
checkout). Or alternatively you could do the same thing with update.

The problem with that IMHO is it doesn't leave us with any obvious place
to put exclude. I really don't think the rm command should be
overloaded for it. Add a svn rm --local or something like that would
be just as ugly as the svn switch --relocate thing we have now.

So I'm inclined to just add two more commands. They should be
relatively straighforward to understand what they do. And most people
would even have to worry about using them.

Backend wise we'll need to extend the reporter to handle saying what you
don't want. Right now the client says what it has and the server
decides what to send. It shouldn't be too hard to do this in a
compatable way. We've already extended the reporter to send a
lock-token. We can just add an exclude="true" entry for DAV and a
exclude:bool for svn. We may want to also add an include="true" to
allow for adding paths to non-recursive checkouts. Without it we'd
probably just implment the --include as a checkout with a series of
includes. But it might be nicer to actually have a real single session
to the server that gets exactly what we want in one step. Disadvantage
is that this will be harder to properly handle in a backwards compatable
way (i.e. we want the server to send us more data than it usually does,
so if it failed to send it to us we'd have to ask for it separately
anyway).

If we add this in 1.3, then 1.3 servers would skip over the excluded
paths and return the rest. When running against 1.2 and older servers
the server would continue to return everything. We'd have to just make
the client filter those updates as it receives them and "do the right
thing". Against older servers there would be horrible performance
network wise but it would be an improvmenet disk space wise. Upgrading
the server to 1.3 would provide good performance for all.

In the longer term we can make things nice by supporting something like
symbolic links within the server. This would allow people in situations
like this to make "views" within the server and to simply do a checkout
against one. We've talked about this but it'd take a schema change that
we don't want to make until 2.0.

Even if we plan on doing symbolic links in the future doing the above in
the near term is still a good idea. The symbolic link trick isn't
really ideal for every use case. Plus I believe that the reporter
changes should make it possible to make `svn foo bar` actually run a
single update request against the server, which would be nice to have
atomic updates.

So here's a summary of what needs to be done to allow for the
"disjointed wcs" that we'd want to support to make this possible:

* make non-recursive checkouts sticky
* add interfaces to manipulate which dirs are to be tracked by the wc.
* add parameter(s) to the reporter code to pass through what to
  exclude and/or include.
* Make the repos code know how to handle the new reporter flags when
  generating update_editor calls.
* Make the wc smart enough to filter the update_editor calls when
  the server ignores our request not to send the extra data.

I think this is fairly realistc to do for 1.3. Thoughts? I don't think
I've missed anything but if I have let me know. I'm guessing the user
interface is really where the controversy will lie.

-- 
Ben Reser <ben@reser.org>
http://ben.reser.org
"Conscience is the inner voice which warns us somebody may be looking."
- H.L. Mencken
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Mon Jul 11 00:33:14 2005

This message: [ Message body ]
Next message: Ben Reser: "Re: Apache 2.0.54 exploit"
Previous message: Philip Martin: "Re: Perl bindings build, then fail 'make check-swig-pl'"
Next in thread: Greg Hudson: "Re: [PROPOSAL] Dealing with large directories/difficult to organize trees."
Reply: Greg Hudson: "Re: [PROPOSAL] Dealing with large directories/difficult to organize trees."

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]