[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

[TSVN] RFC: New cache scheme

From: Will Dean <svn_at_indcomp.co.uk>
Date: 2005-01-20 22:21:54 CET

Guys,

Now that 1.1.3 is out of the way, I'd like to say something about the new
cache I've been working on. It's very preliminary at the moment, but I
thought this would be a good time to get comments about the concept.

I think everybody's aware that the TSVN shell extension does lots of
caching of file status, to try and minimise the amount of SVN work which
goes on as one is browsing through Explorer. In normal situations, this
cache is now pretty good, and I generally find performance to be
acceptable. (I first got involved with the TSVN code because I was so
infuriated by the shell performance - that was partly an SVN problem as it
turned out.)

Anyway, among all its charm, the current shell extension does have some
problems:

1. In order to prevent the cache becoming stale (and, I think, because of
historical concern about the amount of memory it might consume), cached
items have very short lifetimes (a few seconds). In certain cases (large
directories, slow filesystems), this can cause a pathological cache
thrashing, where the time taken to build the cache exceeds the lifetime of
its members. This is disastrous, as the very time you need the cache most
is when it's slow to build. There is plenty of sticking plaster stuck on
this particular wound, but it's not very pretty.

2. Unless you're in recursive status mode, the cache only holds the status
for one folder.

3. The SVN libraries are statically linked and big and slow-to-start. The
shell extension has to include them in order to get item status. Every
process which starts a file-open dialog (not exactly a lightweight activity
at the best of times) has to suffer SVN starting-up and loading
into-process the first time the dialog is opened.

4. Because the shell extension is an in-process COM object (shell
extensions are supposed to be in-process, this isn't a mistake), there is
one cache per process. With the current very short cache lifetimes, this
doesn't really make any difference to anybody, but it could be a
missed-opportunity in terms of re-use of cached items. (For example, I
think it's reasonably probable that you'll have Explorer windows and app
file-open boxes pointing into similar folders.)

5. Shell extensions are a pig to debug.

I have been working on a completely different way of doing things, which
shows some promise. It goes as follows:

1. Create a new application 'TSVNCache', which can run in the background,
with a simple IPC interface which allows other processes to request the SVN
status of a path. There's no U/I on this application.

2. Rip all the SVN status stuff out of the shell extension and replace it
with something which asks TSVNCache for the status of a path. The shell
extension knows nothing about SVN except for the arrangement of a
svn_wc_status_t structure (which is what it's given by TSVNCache). The
cache knows nothing about the shell extension or why it wants the status of
the file, it just returns the status. To take this step to the limit, the
property-page handler would probably need to come out into a separate DLL,
because it's always going to need SVN.

At this point, we've probably slowed things down slightly, because there's
now an inter-process call (on a named pipe) between the shell extension and
the cache. However, the cache is now a nice little stand-alone process,
which one can start and stop at will, play around with and debug
easily. (If you stop TSVNCache, the shell extension just marks things as
unversioned, connecting to the cache again when it restarts.)

So, the next step is to improve the cache:

3. Separate the caching of files and folders, so that you can build a big
cache without needing to search a huge list of unstructured file names.

4. Increase the cache-lifetime (let's say that it's infinite)

5. Keep track of the modification time of files which are cached, and the
modification time of the .svn\entries file, and use these as hints to
invalidate the cache. Note that these hints are agnostic about the client
you use, so you can use the SVN CL and the cache will still be invalidated
properly.

.... This is about where I've got to at the moment ....

I don't currently implement recursive folder status, but my idea for this
is to do something along the following lines:

1. Fetch the minimum required status information synchronously, as at the
moment.
2. As a lazy, background task, recurse downwards from each folder which is
cached, calculating the dominant SVN status for each folder.
3. Issue shell-update requests for folders as their recursive status
becomes known.

Because the cache is now so durable, usable recursive status becomes a real
possibility, which I don't feel it is at the moment (it's more of a
tantalising peek at how good it could be).

So, what do people think about all this? I'm particularly interested in
people's views on the legitimacy of my cache invalidation strategy, but I'd
welcome any input.

(Just for interest, I started by trying to implement something based on
change notifications, which would have meant I could then have the cache
generate all the shell-update notifications, but I don't think this is very
scaleable.)

When I get this a bit more together, I shall also be looking for some
"people with enquiring minds" to try it out.

Cheers,

Will

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tortoisesvn.tigris.org
For additional commands, e-mail: dev-help@tortoisesvn.tigris.org
Received on Thu Jan 20 22:22:53 2005

This is an archived mail posted to the TortoiseSVN Dev mailing list.