Re: svn commit: r11989 - trunk/notes

From: <kfogel_at_collab.net>
Date: 2004-11-22 20:07:03 CET

Philip, THANK YOU. We needed this document so badly!

philip@tigris.org writes:
> Author: philip
> Date: Mon Nov 22 12:37:37 2004
> New Revision: 11989
>
> Added:
> trunk/notes/wc-improvements
> Log:
> * notes/wc-improvements: New file.
>
>
> Added: trunk/notes/wc-improvements
> Url: http://svn.collab.net/viewcvs/svn/trunk/notes/wc-improvements?view=auto&rev=11989
> ==============================================================================
> --- (empty file)
> +++ trunk/notes/wc-improvements Mon Nov 22 12:37:37 2004
> @@ -0,0 +1,171 @@
> +Anyone who has worked on the libsvn_wc code will have discovered that
> +the current code is a complicated mess of special cases, and that it
> +is difficult to understand, inconsistent, slow and buggy. I know this
> +because I wrote some of it. It's possible that the libsvn_wc code
> +will gradually evolve into an elegant, efficient, code base; on the
> +other hand comments like "when we rewrite libsvn_wc" regularly appear
> +on the dev list. This document is *not* a plan or design for a
> +rewrite, it's just some of the thoughts of a libsvn_wc hacker.
> +
> +From Past to Present
> +====================
> +
> +The original code for libsvn_wc used an implementation that stored
> +more or less all state information on disk in the .svn area on disk,
> +so during most operations the entries files were read and written many
> +times. This led to the development of an API that passed around a lot
> +of path parameters (as svn_string_t* originally, as const char* now)
> +and to the development of the svn_io_xxx functions, which also accept
> +path parameters. The implementation was slow, and didn't scale
> +particularly well as working copies got larger. To improve things the
> +current access baton and entries caching system was gradually hacked
> +in, and libsvn_wc is now faster and scales a bit better, but problems
> +still remain.
> +
> +My favourite example of the problems caused by the "path as parameter"
> +API is svn_wc_props_modified_p. It's basic function is to determine
> +if the base props file and the working props file are the same or
> +different, but physical IO operations have to be repeated because they
> +are buried behind several layers of API. It's difficult to fix
> +without rewriting, or duplicating, a number of svn_io_xxx and
> +svn_wc_xxx functions. Aside from the repeated IO itself, each IO
> +operation also has to repeat the UTF-8 to native path conversion.
> +
> +The current entries caching makes things faster than in the past, but
> +has it's own problems. Most operations now cache the entire entries
> +hierarchy in memory which limits the size of the working copies that
> +can be handled. The problem is difficult to solve as some operations
> +make multiple passes--commit for instance makes a first pass searching
> +for modifications, a second pass reporting to the repository, and a
> +third pass to do post-commit processing.
> +
> +The original code also did not always make a distinction between the
> +versioned hierarchy in the entries file and the physical hierarchy on
> +disk. Things like using stat() or svn_io_check_path() calls to
> +determine whether an item was versioned as file or directory do not
> +work when the working copy on disk is obstructed or incomplete.
> +
> +The Future
> +==========
> +
> +Some of these ideas are trivial, some of them are difficult to
> +implement, some of them may not work at all.
> +
> +- Have an svn_wc_t context object, opaque outside the library, that
> + would replace the access batons. This would get passed through most
> + of the libsvn_wc functions and could read/cache the entries files on
> + demand as the working copy was traversed. It could also cache the
> + UTF-8 xlate handle.
> +
> +- Have an API to svn_wc_entry_t, perhaps make the struct opaque, so
> + that things like URL need not be constructed when the entries file
> + is read but can be created on demand if required and possibly cached
> + once created. The aim would be to reduce the memory used by the
> + entries cache.
> +
> +- Consider caching physical IO results in svn_wc_entry_t/svn_wc_t.
> + Should we really stat() any file more than once? This becomes less
> + important as we reduce the number of IO operations.
> +
> +- Consider caching UTF-8 to native path conversions either in
> + svn_wc_t, or svn_wc_entry_t, or locally in functions and using
> + svn_io_xxx equivalents that accept native paths. This becomes less
> + important as we reduce the number of IO operations.
> +
> +- Make interfaces pass svn_wc_entry_t* rather than simple paths. The
> + public API using const char* paths would remain to be used by
> + libsvn_client et al.
> +
> +- Maintain a clear distinction between the versioned hierarchy and the
> + physical hierarchy when writing code, it's usually a mistake to use
> + one when the other should be used. To this end, audit the use of
> + svn_io_check_path().
> +
> +- Avoid using stat() to determine if an item is present on disk before
> + using the item, just use it straight away and handle the error if it
> + doesn't exist.
> +
> +- Search out and destroy functions that read and discard entries files
> + e.g. the apparently "simple" functions like svn_wc_is_wc_root or
> + check_wc_root. Such overhead is expensive when used by operations
> + that are not going to do much other work, running status on a single
> + file for example. The overhead may not matter to a command line
> + client, but it can matter to a GUI that makes many such calls.
> +
> +- Get rid of .svn/README.txt and .svn/format. Put the version
> + information into .svn/entries instead, either in the xml or just as
> + text on the first line. Reading one file rather then two will put
> + less load on the physical filesystem. Consider supporting out of
> + tree .svn directories. Is the size of .svn/entries critical? Would
> + a custom, less verbose, format be better than XML?
> +
> +- In the present code most operations are IO bound and have CPU to
> + spare. Perhaps compressed text-bases would make things faster
> + rather than slower, by trading spare CPU for reduced IO?
> +
> +- Keep track of the last text/prop time written into an entries file
> + and store it in svn_wc_t. Then when we come to do a timestamp sleep
> + we can do it from the that time rather than the current time.
> +
> +- Optimise property handling
> + http://svn.haxx.se/dev/archive-2004-11/0318.shtml
> +
> + * When an item has no props dispense with the base and working
> + property files. When an item has no property modifications
> + dispense with the working property file. These changes should make
> + it possible for svn_wc_props_modified_p to use a single stat()
> + when there are no property mods.
> +
> + * It might be possible to replace prop-time in the entries file
> + with a props-modified flag and so have svn_wc_props_modified_p
> + avoid all disk IO.
> +
> + * Consider caching svn:special in the entries file instead of (or as
> + well as?) the properties file. Status calls svn_wc__get_special
> + for nearly every file, even unmodified ones, and that reads the
> + properties file. If the above property handling optimisations
> + mean that svn_wc_props_modified_p usually avoids all disk IO then
> + svn_wc__get_special could become a major bottleneck.
> +
> +- The svn_wc_t context could work in conjunction with a more advanced
> + svn_wc_crawl_revisions system. This would provide a way of plugging
> + multiple callbacks into a queue, probably with some sort of ordering
> + and filtering ability, the aim being to replace most/all of the
> + existing explicit loops. This would put more of the pool handling in
> + one central location, it may even be possible to provide different
> + entry caching schemes. I don't know how practical this idea is, or
> + even if it is desirable.
> +
> +- Have a .svn/deleted directory so that schedule delete directories
> + can be moved out of the working copy. At present a skeleton hierarchy
> + of schedule delete directories remains in the working copy until the
> + delete is committed.
> +
> +- When handling a delete received during update/switch perhaps do it
> + in two stages. First move the item into a holding area within .svn
> + and finally delete all such items at the end of the update. This
> + would allow adds-with-history to use the deleted item and so might
> + be a way to handle moves (implemented as delete plus add) in the
> + presence of local modifications. Thought would have to be given to
> + the revision of the local deleted item, what happens if it doesn't
> + match the copyfrom revision? Perhaps we could get diffs, rather
> + than full text, for adds-with-history if the copyfrom source is
> + reported to the repository?
> +
> +- Consider writing some libsvn_wc compiled C regression tests to allow
> + more complete coverage. Most of the current libsvn_wc testing is
> + done via the command line client and it can be hard to get a working
> + copy into the state necessary to test all code paths.
> +
> +- There are some basic features that are fragile. Switch has some
> + bugs that can break a working copy, see issue 1906. I don't know
> + how the system is supposed to work in theory, let alone how it
> + should be implemented. Non-recursive checkout is broken, see issue
> + 695; this probably applies to non-recursive update and switch as well.
> +
> +- Use absolute paths within libsvn_wc so that "." is not automatically
> + a wc root.
> +
> +- Read notes/entries-caching for some details of the logging/caching
> + in the current libsvn_wc. It's important that writing the entries
> + file is handled efficiently.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: svn-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: svn-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Nov 22 20:09:22 2004

This message: [ Message body ]
Next message: C. Michael Pilato: "Re: svn commit: r11930 - in branches/locking/subversion: include libsvn_fs_base"
Previous message: Philip Martin: "Re: danger / bug in moving externals"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]