[Arising from some discussion on IRC today.]
I've been considering the problem of a dump/load upgrade for a
repository with a large number of revisions. To minimise downtime the
initial dump/load would be carried out while the original repository
remains live. When the load finishes the new repository is already
out-of-date so an incremental dump/load is carried out. When this
second load finishes the original repository is taken offline and we
want to bring the new repository online as quickly as possible. A final
incremental dump/load is required but that only involves a small number
of revisions and so is fast. The remaining problems are locks and
revprops.
We do not have tools to handle locks so the options are: a) drop all the
locks, or b) copy/move the whole db/locks subdir. I'm not really
interested in locks at present.
Revprops are more of a problem. Most revprops are up-to-date but a
small number may be out-of-date. The problem is we do not know which
revprops are out-of-date. Is there a reliable and efficient way to
bring the revprops up-to-date? We could attempt to disable and/or track
revprop changes during the load but this is not reliable. Post- hooks
are not 100% reliable and revprop changes can bypass the hooks. We
could attempt to copy/move the whole revprops subdir that is not always
possible if the repository formats are different.
One general solution is to use svnsync to bulk copy the revprops:
ln -sf /bin/true dst/hooks/pre-revprop-change
svnsync initialize --allow-non-empty file:///src file:///dst
svnsync copy-revprops file:///src file:///dst
This isn't very fast, I get about 2,000 revisions a minute for
repositories on an SSD. There are typically three revprops per
revisions and the FS/RA API change one at time. Each change must run
the mandatory pre-revprop-change hook and fsync() the repository.
svnsync has a simple algorithm that writes every revprop for each
revision.
A repository with a million revisions svnsync would invoke three million
processes to run the hooks and three million fsync(). Typically, most
of this work is useless because most of the revprops already match.
I wrote a script using the Python FS bindings (see below). This avoids
the hooks and also elides the writes when the values already match.
Typically this just has to read and so will process several hundred
thousand revisions a minute. This will reliably update a million
revisions in minutes.
I was thinking that perhaps we ought to provide a more accessible way to
do this. First, modify the FS implementations to detect when a change
is a noop that doesn't modify a value and skip all the writing. Second
provide some new admin commands to dump/load revprops:
svnadmin dump-revprops repo | svnadmin load-revprops repo
dump-revprops would dump just the revprops and load-revprops would load
into existing revisions rather than creating new revisions. There would
be options to enable/bypass the hooks. I think this would be close to the
efficiency of the script.
#!/usr/bin/python
import sys
from svn import core, fs, repos
src_path = core.svn_path_canonicalize(sys.argv[1])
dst_path = core.svn_path_canonicalize(sys.argv[2])
src_repo = repos.open(src_path)
dst_repo = repos.open(dst_path)
src_fs = repos.fs(src_repo)
dst_fs = repos.fs(dst_repo)
head = min(fs.youngest_rev(src_fs), fs.youngest_rev(dst_fs))
for r in range(0, head + 1):
print r
src_props = fs.revision_proplist(src_fs, r)
dst_props = fs.revision_proplist(dst_fs, r)
for src_name, src_value in src_props.iteritems():
try:
dst_value = dst_props[src_name]
if src_value != dst_value:
fs.change_rev_prop(dst_fs, r, src_name, src_value) # modify
dst_props.pop(src_name)
except:
fs.change_rev_prop(dst_fs, r, src_name, src_value) # add
for dst_name, dst_value in dst_props.iteritems():
try:
src_value = src_props[dst_name]
except:
fs.change_rev_prop(dst_fs, r, dst_name, None) # delete
--
Philip Martin
WANdisco
Received on 2015-07-24 21:58:46 CEST