Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

From: Karl Fogel <kfogel_at_newton.ch.collab.net>
Date: 2003-04-29 00:51:44 CEST

Greg Stein <gstein@lyra.org> writes:
> This string method usage requires Python 2.0, whereas cvs2svn.py used to
> only require Python 1.5.2. And with the future dropping of the bindings,
> then it will be even easier to only do 1.5.2 (while our bindings can work
> against 1.5.2, I'm not sure how many people try that).
>
> So... that would change to:
>
> path = string.lstrip(path, '/')

Ah, hmmm. I was wondering about that. The above usage is what I
tried first, but it gave an error in Python 2.2.2... What to do?

     $ python
     Python 2.2.2 (#1, Feb 19 2003, 12:12:23)
     [GCC 2.96 20000731 (Red Hat Linux 7.0)] on linux2
     Type "help", "copyright", "credits" or "license" for more information.
>>> import string
>>> path = "/path/to/somewhere"
>>> path = string.lstrip(path, '/')
     Traceback (most recent call last):
       File "<stdin>", line 1, in ?
     TypeError: lstrip() takes exactly 1 argument (2 given)
>>> path = path.lstrip('/')
>>> path
     'path/to/somewhere'
>>>
     $

> > + path_so_far = None
> > + components = string.split(path, '/')
>
> One of the things that I like to do is:
>
> components = filter(None, string.split(path, '/'))
>
> That filter() usage will filter out all empty strings, which means it
> ignores leading and trailing slashes, and double-slashes. Kinda nifty :-)

Ooooh, I like it.

> > + last_idx = len(components) - 1
> > + this_node = root
> > +
> > + i = 0
> > + while (i < last_idx):
> > +
> > + component = components[i]
>
> I don't think you need the "i" logic. Just do:
>
> for component in components:

That's what I started with, actually -- but I need to know when I hit
the component right *before* the last component. I wish loops had a
'next_to_last' keyword or something :-).

> > + if path_so_far:
> > + path_so_far += '/' + component
>
> The += construct is also Python 2.0 based.

Okay. Mike Pilato also pointed this out, but I didn't realize at the
time that we were within spitting distance of 1.5.2 compatibility.
I'll get rid of the "+=" stuff.

> > +def get_md5(path):
> > + """Return the hex md5 digest of file PATH."""
> > + f = open(path, 'r')
> > + checksum = md5.new()
> > + buf = f.read(102400)
> > + while buf:
> > + checksum.update(buf)
> > + buf = f.read(102400)
> > + f.close()
> > + return checksum.hexdigest()
>
> There isn't a real need to close the file, as it will happen when "f" goes
> out of scope.

Thanks.

> >...
> > + # Make the dumper's temp directory for this run. RCS working
> > + # files get checked out into here.
> > + os.mkdir(self.tmpdir)
>
> Couldn't you use 'co -p' and avoid temp files altogether?

Right now, it needs the checksum before it writes the file data to the
dumpfile (see the comment above that area), since the checksums go in
the headers. In the plan to backpatch the checksums, then yes, it
could become a stream again.

[Oh wait, I just saw that you address that issue a bit further down...]

> >...
> > + # Anything ending in ".1" is a new file.
> > + #
> > + # ### We could also use the parent_node to determine this.
> > + # ### Maybe we should, too, because ".1" is not perfectly
> > + # ### reliable, because of 'cvs commit -r'...
> > + if re.match('.*\\.1$', cvs_rev):
>
> Bah. Perl-style overkill :-)
>
> if cvs_rev[-2:] == '.1':

Heh. Will remember (although this logic may be about to go away).

> > + self.dumpfile.write('Node-path: %s\n' % svn_path)
> > + self.dumpfile.write('Node-kind: file\n')
> > + self.dumpfile.write('Node-action: %s\n' % action)
> > + self.dumpfile.write('Prop-content-length: %d\n' % 10) ### svn:executable?
> > + self.dumpfile.write('Text-content-length: %d\n' % os.path.getsize(working))
> > + self.dumpfile.write('Text-content-md5: %s\n' % get_md5(working))
> > + self.dumpfile.write('Content-length: %d\n' % 0) # todo
> > + self.dumpfile.write('\n')
> > + self.dumpfile.write('PROPS-END\n')
>
> Might be interesting to have a little helper function/object that writes out
> all this stuff, and can be shared between here and the Node class.

Yeah, not sure yet what level of factorization will be appropriate,
but will keep an eye out.

> Right. This plays into the use of 'co -p', too. Nominally, it would be great
> to be able to do:
>
> f = os.popen('co -p ...')
> checksum = md5.new()
> buf = f.read(CHUNK_SIZE)
> while buf:
> checksum.update(buf)
> dumpfile.write(buf)
> buf = f.read(CHUNK_SIZE)
> digest = checksum.hexdigest()
>
> The problem is going back. You can use dumpfile.tell() to get a seek
> position, and then do the backwards seek stuff. Another alternative would be
> taking a direction from HTTP and have the notion of "trailing" headers. The
> checksum would come in an RFC822-ish "header block" *after* the content.
>
> Oh, and your "patchup phase" wouldn't be that complicated. Just something
> like:
>
> dumpfile.write('Checksum: ')
> pos = dumpfile.tell()
> dumpfile.write(CHECKSUM_PLACEHOLDER + '\n')
> ...
> copy content
> ...
> dumpfile.seek(0, pos)
> dumpfile.write(checksum.hexdigest())
> dumpfile.seek(2, 0)
>
> You could also do:
>
> import FCNTL
> ...
> dumpfile.seek(FCNTL.SEEK_SET, ...
> ...
> dumpfile.seek(FCNTL.SEEK_END, ...
>
> rather than 0 or 2.

Zing! You have solved one of the two big problems, sir. Thank you,
I'll do the backpatching in real time.

The other problem is that Node tree. Because Subversion uses paths
for branches, the full size of the node tree is much greater than just
the tree one would get from

$ ls -R cvs_repository/

Rather, it's that repository tree multiplied (more or less) by the
number of unique branches and tags in the RCS files. That's biiig.
So farming the Node tree out to disk is not just a "nice to have"
anymore, it's a requirement for graduation :-).

(If you have any suggestions for a particularly Python-ish way to do
it, they are welcome of course. I was just going to use `anydbm' and
come up with some simple representation, perhaps involving pickling.)

Thanks for the quick review, especially so early in the rewrite,
-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Apr 29 01:38:31 2003

This message: [ Message body ]
Next message: Greg Hudson: "Re: Compressed text-base patch"
Previous message: rbb_at_rkbloom.net: "Re: Compressed text-base patch"
In reply to: Greg Stein: "Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn"
Next in thread: Greg Stein: "Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn"
Reply: Greg Stein: "Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn"
Reply: Jeffrey C. Ollie: "Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]