Re: [PATCH]: Revised cvs2svn.py patches

From: Greg Stein <gstein_at_lyra.org>
Date: 2002-02-12 04:56:15 CET

On Fri, Feb 08, 2002 at 01:25:11AM -0500, Daniel Berlin wrote:
> In order to make Greg's life easier in terms of extracting pieces of the
> patch, i cut the cvs parser down to only store the author and revision
> log, and removed the file format change to store the author (since we get
> it elsewhere now).

Ah. Very good. Yes, this patch is much cleaner. However, I haven't committed
it in its exact form, so I'll do a detailed review of this to let you know
what I kept, what I skipped, and what I did a bit differently.

The changes are pretty big, but one fact remains: you led the way on getting
the commits to occur, and I'd like to give a big THANKS! for that.

As a test case, I converted the mod_dav CVS repository in under 10 minutes
(480 distinct commits). It looks like there is a still some minor issues
with determining the groups of a commit, but that should be fixable. The
bulk of the logic to actually commit a transaction appears to be working.

[ I still want to write a script to extract each SVN commit, extract each
CVS tree at a date, and compare the two trees ]

[ to fix the commit groupings, I'm going to ensure each commit only includes
a file once; if a file occurs a second time, then we will need to slice
the group into one or more groups. ]

> The only functionality change from last time is that it also now sets the
> date on the commits properly (since it gets overwritten when we commit
> the transaction, we have to go and change it afterwards).

Cool.

>...
> +++ ./cvs2svn.py Thu Feb 7 21:38:23 2002
> @@ -2,8 +2,10 @@
> #
> # cvs2svn: ...
> #
> -
> +import bisect
> +import statcache
> import rcsparse
> +import sink
> import os
> import sys
> import sha

Kept statcache, tossed the others.

> @@ -12,8 +14,12 @@
> import fileinput
> import string
> import getopt
> -
> -
> +import Cache
> +from svn import fs, _util, _delta
> +_util.apr_initialize()
> +pool = _util.svn_pool_create(None)
> +fspool = _util.svn_pool_create(None)
> +logcache = Cache.Cache(size=50)

The Cache module wasn't in this patch. I've dumped everything related to the
log cache for now.

The svn imports are still there (although, it is now "util"), and I also
grab the _repos module for build the target repository.

The pool handling is now done differently, and I use the new run_app()
function from svn.util.

>...
>
>
>
>
>
>
>
>
>
>
>
>
>
> -
> -
> +
> +
>
>
>
>
>
>
>
> -
> +
>
>
>
>
>
>
>
> -
> +
> +
@@ -26,7 +32,7 @@ SVNROOT = 'svnroot' ATTIC = os.sep + 'Attic' -COMMIT_THRESHOLD = 5 * 60 # flush a commit if a 5 minute gap occurs +COMMIT_THRESHOLD = 5 * 60 # flush a commit if a 5 minute gap occurs OP_DELETE = 'D' OP_CHANGE = 'C' @@ -99,8 +105,8 @@ # shove the previous revision back in time (and any before it that # may need to shift). while t_p >= t_c: self.rev_data[prev][0] = t_c - 1 # new timestamp self.rev_data[prev][3] = t_p # old timestamp self.rev_data[prev][0] = t_c - 1 # new timestamp self.rev_data[prev][3] = t_p # old timestamp print 'RESYNC: %s (%s) : old time="%s" new time="%s"' \ % (relative_name(self.cvsroot, self.fname), @@ -110,7 +116,7 @@ prev = self.prev[current] if not prev: break t_c = t_c - 1 # self.rev_data[current][0] t_c = t_c - 1 # self.rev_data[current][0] t_p = self.rev_data[prev][0] # break from the for-loop @@ -129,7 +135,8 @@ self.resync.write('%08lx %s %08lx\n' % (old_ts, digest, timestamp)) self.revs.write('%08lx %s %s %s %s\n' % (timestamp, digest, op, revision, self.fname)) op, revision, self.fname))

All the above changes are simply whitespace formatting; I ignored them.
Please try to keep whitespace changes out of a functional patch, cuz it just
makes it harder to find the "meat" of the patch.

[ our typical practice is that when we do want to change whitespace or
  reflow comments or whatever, we make a different patch that only does
  whitespace; of course, that's just a general rule :-) we'll also do
  whitespace changes during some commits, but large changes are frowned
  upon... ]

>...
> @@ -140,7 +147,7 @@
> return l
>
> def visit_file(arg, dirname, files):
> - cd, p, stats = arg
> + cd, stats = arg
> for fname in files:
> if fname[-2:] != ',v':
> continue
> @@ -151,15 +158,59 @@
> cd.set_fname(pathname)
> if verbose:
> print pathname
> - p.parse(open(pathname), cd)
> + rcsparse.Parser().parse(open(pathname), cd)
> stats[0] = stats[0] + 1

I didn't see the point of this change, so I left it out.

I'm guessing it is to make it easier to substitute different parsers in
there. Lucas Bruand just made some change in the ViewCVS repository that
might help with parser substitution (although I haven't reviewed what he has
done yet).

> +class RevInfoParser(rcsparse.Sink):
> +
> + def __init__(self):
> + self.Reset()
> + def Reset(self):
> + self.revision_author = {}
> + self.revision_log = {}
> + def define_revision(self, revision, timestamp, author, state,
> + branches, next):
> + # save author
> + self.revision_author[revision] = author
> +
> +
> + # Construct associative arrays containing info about individual revisions.
> + #
> + # The following associative arrays are created, keyed by revision number:
> + # revision_log -- log message
> + def set_revision_info(self, revision, log, text):
> + self.revision_log[revision] = log

I collapsed the Reset() into the __init__ method. Simplified the instance
variables to just 'authors' and 'logs'. Removed the extraneous comments, and
added a bit of whitespace.

> + def parse_cvs_file(self, rcs_pathname, opt_rev = None, opt_m_timestamp = None):
> + # Args in: opt_rev - requested revision
> + # opt_m - time since modified
> + # Args out: revision_map
> + # timestamp
> + # revision_deltatext
> +
> + # CheckHidden(rcs_pathname);

Removed the unused params, and tossed these comments.

> + try:
> + rcsfile = open(rcs_pathname, 'r')
> + except:
> + try:
> + rcs_pathname = os.path.join(os.path.split(rcs_pathname)[0],
> + "Attic", os.path.split(rcs_pathname)[1])

Simplified the above construction a bit by doing the split() once, and
assigning to dirname, fname. Then reassembling.

>...
> class BuildRevision(rcsparse.Sink):
> def __init__(self, rev, get_metadata=0):
> self.rev = rev
> self.get_metadata = get_metadata
> self.result = None
> -
> + self.prev_delta = {}
> + self.d_command = re.compile("^d(\d+)\s+(\d+)")
> + self.a_command = re.compile("^a(\d+)\s+(\d+)")

Skipped. This class isn't used right now.

I'll fold this into a secondary patch, just for future's sake.

>...
> @@ -177,7 +228,6 @@
> revision = self.prev_delta.get(revision)
> path.reverse()
> self.collect = path
> -
> def set_revision_info(self, revision, log, text):

Ignored the whitespace change.

>...
> @@ -200,7 +250,7 @@
> else:
> adjust = 0
> diffs = string.split(text, '\n')
> -
> + add_lines_remaining = 0
> for command in diffs:

Unused class; ignored this for now.

>...
> @@ -223,6 +273,7 @@
> count = string.atoi(amatch.group(2))
> add_lines_remaining = count
> else:
> + print "Diff commands:%s Current: %s" % (diffs, command)
> raise RuntimeError, 'Error parsing diff commands'

Some debug stuff? Ignored for now.

>...
> @@ -240,20 +291,117 @@
> self.t_max = t
>
> if op == OP_CHANGE:
> - self.changes.append((file, rev))
> + self.changes.append((file[0:-2], rev))
> else:
> # OP_DELETE
> - self.deletes.append((file, rev))
> + self.deletes.append((file[0:-2], rev))

I kept the ,v in the file name, and handle its removal later, when I
generate a repository path for a given ,v file.

> def commit(self):

This method is considerably revamped. It now takes a target fs object (t_fs)
and the convertion context (ctx). The pools that were created above, are now
created, cleared, and destroyed within this method only (rather than as
globals). I also handle the author/log outside of the per-file loop since
they are (by definition) the same for an entire commit. Also, I ignore the
first assignment of the timestamp, and just do it once, post-commit.

> # commit this transaction
> print 'committing: %s, over %d seconds' % (time.ctime(self.t_min),
> self.t_max - self.t_min)
> + rev = fs.youngest_rev(fsob, pool)
> + txn = fs.begin_txn(fsob, rev, pool)
> +
> + root = fs.txn_root(txn, pool)

This all references the new pool handling.

> + lastcommit = (None, None)
> for f, r in self.changes:
> print ' changing %s : %s' % (r, f)
> + ps = os.path.split(f)[0]
> + ps = string.split(ps,os.sep)
> + for i in xrange(1, len(ps)+1):
> + if (fs.check_path(root, string.join(ps[0:i],os.sep), pool) == 0):
> + print "Making dir %s" % string.join(ps[0:i],os.sep)
> + fs.make_dir(root, string.join(ps[0:i],os.sep), pool)
> + repofilepath = f

I've revamped the computation of the repository path. I'm guessing that your
CVS repository was relative or something, and could be used for repos paths.
But that doesn't always hold (mine was ../cvsroots/mod_dav for my test), so
I use the relative_name() function to compute a repos path.

The directory construction is also simplified a bit, and I avoid checking
for creating the '/' path.

And the variable is now named 'repos_path' to match a similar name that I
use in mod_dav_svn.

> + if (fs.check_path(root, f, pool) == 0):
> + justmadefile = 1
> + fs.make_file(root, f, pool)
> + else:
> + justmadefile = 0

Same, but with a new pool variable, and the variable is 'created_file'

> + handler, baton = fs.apply_textdelta(root, f, pool)
> +
> + f = f + ",v"

Not needed, since I keep the ,v on there.

> +
> + # See if we have a revision and author log for this file in the cache
> + # Otherwise, parse the file with the cvs parser and recache the
> + # log.
> + temptuple = logcache.get(f)
> + if temptuple is None:
> + cvp = RevInfoParser()
> + cvp.parse_cvs_file (f)
> + logcache[f] = (cvp.revision_log, cvp.revision_author)
> + revlog = cvp.revision_log
> + authorlog = cvp.revision_author
> + del cvp
> + else:
> + revlog, authorlog = temptuple

The above code is all gone now. Rather than once per file, at the end of the
commit processing, I call a new function (get_metadata) to fetch the author,
log message, and formatted date. At the moment, there is no caching, but we
can add it to that one method.

(it's nice and localized now, which is Goodness(tm), IMO)

> + # Get the real file path to give to co
> + try:
> + statcache.stat (f)
> + except:
> + f = os.path.join(os.path.split(f)[0], "Attic", os.path.split(f)[1])
> + statcache.stat (f)

Kept, but I do the split() just once now. Also dropped the
space-before-paren format style.

> + # If we just made the file, we can just send a string for the new file,
> + # rather than streaming it.
> + if justmadefile:
> + _delta.svn_txdelta_send_string(os.popen("co -q -p%s %s" %(r, f), "r", 102400).read(), handler, baton, pool)
> + else:
> + # Open the pipe to co
> + infile = os.popen("co -q -p%s %s" % (r, f), "r", 102400)

I open the pipe once, and then use it differently in the two branches.

> + # Open a SVN stream for that pipe
> + stream2 = _util.svn_stream_from_stdio (infile, pool)
> +
> + # Get the current file contents from the repo, or,
> + # if we have multiple CVS revisions to the same file
> + # being done in this single commit, then get the
> + # contents of the previous revision from co, or
> + # else the delta won't be correct because the contents
> + # in the repo won't have changed yet.
> + if repofilepath == lastcommit[0]:
> + infile2 = os.popen("co -q -p%s %s" % (lastcommit[1], f), "r", 102400)
> + stream1 = _util.svn_stream_from_stdio (infile2, pool)
> + else:
> + stream1 = fs.file_contents (root, repofilepath, pool)
> + txstream = _delta.svn_txdelta(stream1, stream2, pool)
> + _delta.svn_txdelta_send_txstream (txstream, handler, baton, pool)

All the same, but with a different pool now.

> + _util.svn_stream_close (stream2)
> + infile.close()
> + if repofilepath == lastcommit[0]:
> + infile2.close()

I avoid most of this. For the pipes, I still close them, but the streams and
the delta stream just go away with the per-file pool.

> + # We might as well reset the properties on every change
> + # for right now
> + fs.change_txn_prop (txn, "svn:log", revlog[r], pool)
> + fs.change_txn_prop (txn, "svn:author", authorlog[r], pool)
> + fs.change_txn_prop (txn, "svn:date", _util.svn_time_to_nts(_util.apr_ansi_time_to_apr_time(self.t_max)[1], pool),pool)

Done just once now, and the data comes from get_metadata()

> + lastcommit = (repofilepath, r)
> +
> for f, r in self.deletes:
> print ' deleting %s : %s' % (r, f)
>
> + # If the file was initially added on a branch, the first mainline
> + # revision will be marked dead, and thus, attempts to delete it will
> + # fail, since it doesn't really exist.
> + if (r != "1.1"):
> + fs.delete(root, f, pool)

All kept, but I use the newer repos_path construction for this.

> + conflicts, new_rev = fs.commit_txn(txn)
> + newdate = _util.svn_time_to_nts(_util.apr_ansi_time_to_apr_time(self.t_max)[1], pool)
> + fs.change_rev_prop (fsob, new_rev, "svn:date", newdate, pool)

This is still the same, although the date comes from self.get_metadata().

> +
> + # If we don't clear the pool, we'll continually eat up memory.
> + # This pool only contains objects it's okay to delete. The fs object is
> + # in a different pool.
> + _util.svn_pool_clear (pool)

Done slightly differently.

> +
> + if conflicts:
> + print 'conflicts:', conflicts
> + print 'New revision:', new_rev

The same, but I'm finding that the conflicts come back as '\n'. Probably
some weird 'join' behavior somewhere.

>...
> @@ -300,9 +448,8 @@
>
> def pass1(ctx):
> cd = CollectData(ctx.cvsroot, DATAFILE)
> - p = rcsparse.Parser()
> stats = [ 0 ]
> - os.path.walk(ctx.cvsroot, visit_file, (cd, p, stats))
> + os.path.walk(ctx.cvsroot, visit_file, (cd, stats))

Per the comments for visit_file(), I skipped this change.

>...
> @@ -357,15 +504,16 @@
> # process the logfiles, creating the target
> commits = { }
> count = 0
> -
> for line in fileinput.FileInput(ctx.log_fname_base + SORTED_REVS_SUFFIX):

Skipped whitespace change.

> timestamp, id, op, rev, fname = parse_revs_line(line)
> -
> - if commits.has_key(id):
> - c = commits[id]
> - else:
> - c = commits[id] = Commit()
> - c.add(timestamp, op, fname, rev)
> + # Only handle trunk revision commits until branch handling is
> + # finished in the committer
> + if (trunk_rev.match(rev)):

I used a 'continue' here, but this is otherwise the same.

>...
> @@ -373,12 +521,23 @@
> if c.t_max + COMMIT_THRESHOLD < timestamp:
> process.append((c.t_max, c))
> del commits[id]
> -
> process.sort()

Skipped whitespace change.

> for t_max, c in process:
> c.commit()
> count = count + len(process)
>
> + # I have a repository with all commits occurring within the
> + # first 5 minutes. Thus, none of the commits will be processed
> + # since c.t_max + COMMIT_THRESHOLD is always > timestamp
> + # Check for this by seeing if some commits are still in the commits
> + # list, and if so, commit them
> + if (len(commits) != 0):
> + for id, c in commits.items():
> + process.append((c.t_max, c))
> + process.sort()
> + for t_max, c in process:
> + c.commit()
> + count = count + len(process)

Kept all this, but with a new comment. Initialized 'process' before shoving
data into it.

>...
> @@ -417,11 +576,14 @@
> print ' total:', int(times[len(_passes)] - times[start_pass-1]), 'seconds'
>
> def usage():
> - print 'USAGE: %s [-p pass] repository-path' % sys.argv[0]
> + print 'USAGE: %s [-p pass] [ -h db home ] repository-path' % sys.argv[0]

I ignored the extra switch for now. It always creates a repository at
'./svnroot' right now.

Note that it *creates*. The intent is to convert an entire CVS repository
into a new SVN repository. Not sure there is a use case for doing that into
an existing repository.

> sys.exit(1)
>
> def main():
> - opts, args = getopt.getopt(sys.argv[1:], 'p:v')
> + global fsob
> + db_path = os.curdir
> + _util.apr_initialize()

The fs object is created in the pass4() function now, along with the
repository (and it will fail if the repos exists). The apr_initialize() is
handled by run_app() now.

> + opts, args = getopt.getopt(sys.argv[1:], 'p:h:v')

Not doing this (yet?).

> if len(args) != 1:
> usage()
> verbose = 0
> @@ -435,7 +597,18 @@
> sys.exit(1)
> elif opt == '-v':
> verbose = 1
> + elif opt == '-h':
> + home = value
> + db_path = os.path.join(home, 'db')
> + if not os.path.exists(db_path):
> + db_path = home

Same.

> + fsob = fs.new(fspool)
> + fs.open_berkeley(fsob, db_path)
> convert(args[0], start_pass=start_pass, verbose=verbose)
> + fs.close_fs(fsob)
> + _util.apr_terminate()

The app handling is a bit different now.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Sat Oct 21 14:37:06 2006

This message: [ Message body ]
Next message: Greg Stein: "config/tarball status (was: [PATCH]: Add bindings for SVN/APR time functions)"
Previous message: Blair Zajac: "Re: [PATCH] echo -n is not portable and used too much"
In reply to: Daniel Berlin: "[PATCH]: Revised cvs2svn.py patches"
Next in thread: Daniel Berlin: "Re: [PATCH]: Revised cvs2svn.py patches"
Reply: Daniel Berlin: "Re: [PATCH]: Revised cvs2svn.py patches"
Reply: Daniel Berlin: "Re: [PATCH]: Revised cvs2svn.py patches"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]