Re: Every Version of Every File in a Repository

From: Alexey Neyman <stilor_at_att.net>
Date: Tue, 07 Oct 2014 17:36:04 -0700

On Wednesday, October 08, 2014 12:41:01 AM Branko Čibej wrote:
> On 07.10.2014 22:36, Andreas Mohr wrote:
> > Hi,
> >
> > That's certainly a somewhat tough one.
> >
> >
> > I will get tarred and feathered here for my way of trying to solve this,
> > and possibly even rightfully so, but... ;)
>
> Well, I certainly won't skin you alive for suggesting this; but ... I
> would imagine that "git svn fetch" has to essentially do just what the
> OP doesn't want to do, i.e., successively retreive each revision of
> every file in the Subversion repository to populate the Git repository.
> There's not much chance this would be faster than just doing the same
> with Subversion, especially since, once you're done you /still/ have to
> scan the files resulting Git repo.
>
>
> Going back to the original question ...
>
> > Aside from the brute-force method of checking out the entire repository
> > starting at revision 1 , performing a scan, updating to the next
> > revision,
> > and repeating until I reach the head, I don’t know of a way to do this.
>
> This is, in fact, likely to be (almost) the most efficient way to do
> this, since you can just use the existing Subversion client to deal with
> the repository contents and version discrepancies.
>
> But there is an alternative that might be more efficient in your case:
> Create a dumpstream of the repository using "svnadmin dump",
> non-incremental and not using deltas, then pipe the stream to a custom
> tool that extracts the file contents the stream and either writes them
> to disk, or passes them to your scanning tool in some other way.
>
> The reason why this could be faster than the checkout+repeated update is
> that you do not have the overhead of a working copy, directory tracking,
> property handling, etc. etc., and you can probably save on disk space by
> keeping the file contents around only as long as they're being scanned.
> It does mean that your custom tool will have to parse the dumpfile
> format, but that's really not so hard, the format is quite simple, and
> there are a number of example scripts that do that in our repository.
> Another alternative is to use our API directly, possibly through one of
> the bindings, to get file contents straight from the repository; but I
> suspect it's harder than parsing the dump file.

The Python bindings for parsing the dumpstream currently do not work as I described on
svn-dev@ some time ago: the layer which does "thunking" of the C calls back to Python
code is not implemented right now. As far as I can see, Perl/Ruby bindings have the same
problem.

That, and the way to create a stream in Python does not seem to be working - see the
email I just sent to svn-dev@ a few minutes ago. Ironically, I found that when I tried to test
the implementation of this "thunking" code for parsing the dumpstream :) Not sure if this
affects Perl/Ruby.

So, back to your advice - it's either using C library directly, or implementing the parser for
the stream. Which isn't hard, I admit.

Regards,
Alexey.
>
> -- Brane
Received on 2014-10-08 02:36:41 CEST

This message: [ Message body ]
Next message: Kumar Krishnamoorthy: "Exception reporting"
Previous message: Branko Čibej: "Re: Every Version of Every File in a Repository"
In reply to: Branko Čibej: "Re: Every Version of Every File in a Repository"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]