[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Improving the performance of libsvn_wc for checkouts/updates/switches

From: Josh Pieper <jjp_at_pobox.com>
Date: 2004-05-22 15:11:28 CEST

Since ghudson is working on turning a O(n^2) import into O(n), I
thought I would look at what in the working copy library was causing
O(n^2) behavior for checkouts and updates. This mail is an attempt to
show my solution to the problem and make sure that I am not violating
any of the guarantees we want the working copy library to make. Let
me know if something isn't right, or can be improved.

First, why the n^2 behavior is there: During a checkout, the update
editor receives each new file addition one by one. For each addition,
it writes out a short log file that describes the changes necessary to
incorporate the one file, then it runs that short log file. The log
file causes the entire entries file to be read in, changes to be made,
and then the entire entries file to be written back out. Since
reading and writing the entries file takes time proportional to the
number of files in the directory, the cost to add a new file is also
proportional to the number of files already in the directory, thus
causing O(n^2).

My solution:

Step 1:
The first step was to make the update editor create one big log file
for each directory, running this log file when the directory is closed
(deletes have the log file run immediately, so that updates that
delete then add work). This involved opening a new log file when the
directory is opened and removing all the open and close log file code
from the beginning and end of each individual operation. This way
when the log file is eventually run, all the information needed to
construct the final entries file is available in one place, but all
the non-written to disk changes are still in the log file in the event
of failure.

Step 2:
Make the log file execution code read the entries file once at the
beginning of the log file, then make all subsequent changes to an in
memory hash. The entries file is written out only once, immediately
before the log file is completed. This involved making new versions
of svn_wc__entry_modify and svn_wc__get_keywords that would use the
in-memory entries hash.

Pros:
 * O(n) behavior when adding/modifying files using the update editor.

Cons:
 * A lot more temporary disk space is used as the log file for an
   entire directory is created.
   
 * An update of all deletes will still be n^2.

 * If log file execution is interrupted, none of the entries will have
   been written to disk. When cleanup happens, the entire log-file
   will have to be re-read and executed.

Performance:
There are still O(n^2) bits in libsvn_fs right now, so checkouts
aren't linear yet. However, this does still improve performance right
now by about a factor of 4. Below are some preliminary speed numbers
for a checkout of a large directory.

BDB @ trunk
# of files, wall clock time, (cpu time)
-----------------------------------------
100: 1.299 (0.244)
200: 1.824 (0.698)
400: 4.271 (2.317)
800: 10.761 (8.217)
1600: 37.004 (31.030)
3200: 2m22.65 (2m2.11)

BDB w/ O(n) changes
----------------------
100: 0.716 (0.113)
200: 1.112 (0.250)
400: 1.526 (0.618)
800: 4.013 (1.815)
1600: 8.273 (5.741)
3200: 24.638 (20.838)

-Josh

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat May 22 15:12:04 2004

This is an archived mail posted to the Subversion Dev mailing list.