Marko Macek <Marko.Macek@gmx.net> writes:
> I was trying to separate the actual commit step that creates the
> repository from the heuristics that generate the commit sets.
> It is also easier to test/develop things if one can diff the
Great! Yes, I totally agree.
> > The entry maps file paths to revision ranges
> > filepath --> (N M)
> > where N is the Subversion revision corresponding to the CVS revision
> > X.Y in which the branch was created for that filepath, and M is the
> > Subversion revision corresponding to X.(Y+1).
>
> I suggest we ignore the memory usage problem for now because practice
> it's not a problem except for extremely large repositories. We should
> get the algorithm right first.
We can ignore it for a bit, but I know of several specific large
repositories out there that will want to convert. They will simply
grind to a halt if we do stuff like this in memory; we're talking
about a decade of revision history and thousands of branches/tags :-).
> One thing to be aware of: for nested branches or tags on branches, the
> decision of when to copy the branch will affect the revisions above.
> It may be necessary to run the algorithm several times, for each level
> of tags. what complicates things is that tags (and therefore branches)
> can tag files of different branches.
Yes. We may have to get all branches first, then do all tags in a
separate pass?
I had some trouble understanding your algorithm description:
> But my current algorithm works differently, mainly because I wanted to
> avoid delete operations when doing a tag copy and also reduce memory
> usage. And because it is simpler (but in current implementation,
> pretty slow).
>
> It goes like this:
>
> pass4: go over the .s-revs file and make a set of CVS (f,r) for each
> tag/branch. Also separates tags from branches and determines the
> branch /tag dependencies (for now it incorrectly assumes it is a
> tree). This is kept in memory for now.
>
> pass5: Generates a commits file like exactly like current cvs2svn. This
> should probably be careful not to combine files/revs that are only
> partially tagged, but it isn't yet.
>
> pass6: Read the commits file and keep track of the current repository
> state. After each commit, compare the state to all tags. We then have
> several cases:
>
> a) perfect match. Mark this tag as having a possible perfect copy.
> This handles things like files being and then removed.
"being and then removed" ... is the missing word "added"?
> b) partial match. We have a match which has some files, but no extra
> files. Remember the match if it's better than previous partial match.
Is "extra" really "all"? I guess I don't understand this part with
precision.
> c) no match. If there is no partial or perfect match, the tag will be
> copied file by file.
I understand step (c), but am unclear on the exact circumstances that
cause it to be invoked.
> b) Branch must be copied before there is any commit to it.
Yes :-).
> Also, it should be resonably easy (but slower) to organize the files
> into a directory tree instead of a list). This would make it possible
> to handle subtree copies instead failing back to file-by-file copies.
That sounds like a good idea.
> I haven't actually decided which algorithm I prefer.
> Initially, the second one would be made of simpler building
> blocks and there easier to debug and more bulletproof (like the
> current file-by-file is), but it didn't exactly turn out to be.
> I'd still prefer not to do any deletes when copying branches/tags.
> I'm also not exactly a fan of using dbm (text files would be more
> debuggable).
When it's a simple use, like tying to a DBM to get basically a
persistent hash, they're really no problem. They behave predictably
and (in my experience) don't interfere with development and debugging.
I'm not saying we absolutely must use this method, just that
debuggability shouldn't be a big concern if we do.
> 1. If one adds a file on the trunk and then adds the file with the
> same name on the branch, the CVS revisions will be 1.1 for trunk, and
> 1.1.1.1 for branch. This looks like the file was branched, but
> actually it wasn't. (This needs to be added in the test suite).
It's the same 1.1.1.1 if you branch the existing file but don't commit
any changes on the branch, right? (In other words, the two cases
can't be distinguished.)
> 2. cvs import-ed repositories. When a cvs repository has been created
> with CVS import, all the tags on unmodified files will be created for
> revision 1.1.1.1, not revision 1.1. This is a major complication. My
> code currently has a hack to remap version 1.1.1.1 to 1.1 and skip
> vendor branch conversion, but this could be wrong if vendor branch was
> imported later. (I have a test repository where someone imported his
> own working copy... :)
Are you talking only about repeated vendor branch imports, or all
imports?
My understanding is that if a file is at 1.1 (i.e., its initial
state), and someone creates a branches on that revision, then the
branch's initial revision will be 1.1.1.1, before anything is
committed on the branch. Could be misremembering.
When you dig up (or write) patches, can you try to make them as
complete, self-contained, and small as possible :-)? For example,
maybe first let's add the separate `.commits' file, without any tag or
branch changes. It'll make it much easier for me to review the
patches in a timely manner, if I can grok exactly what each change
does.
I realize this makes the communication/code ratio go up, but I think
it's worth it. Hope you agree...
-K
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Mar 10 22:26:11 2003