[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Merge tracking proposal

From: Daniel Berlin <dberlin_at_dberlin.org>
Date: 2006-04-28 18:34:04 CEST

Among other things I am working on at Google, I have been tasked
full-time with implementing merge tracking.

As part of this, I have come up with a design I plan on implementing for
tracking what revisions have been merged where, in a manner that is
suitable for use by various other operations (history sensitive merging,
etc).

In doing so, I reviewed the use cases that were kindly written up, and
believe that most if not all of them can be accomplished with this
design.

Please remember that this design is *only* for tracking what changes are
merged where. I expect this to be the easy part, compared to deciding
exactly what algorithms our history sensitive merge uses, and how it
proceeds.

I have divided the design into four portions "Goals", "information
storage", "information updating", "other prereqs to being able to
implement the design".

The "random questions and answers" section is there to answer common
questions other developers I've talked to while coming up with this
design have had, in the hopes that it will answer some common queries
the list may have.

Goals:

The overarching goal here is to track the revision numbers being merged
by a merge operation, and keeping this information in the right places
as various operations (copy, delete, add, etc) are performed.

The goals of this design are:
1. To be able to track this down to what files in a working copy and be
able to determine what files have had what revisions merged into them.

2. To not need to contact the server more than we already do now to
determine which revisions have been merged in a file or directory (ie
some contact is acceptable, asking the server about each file is not).

3. To be able to edit merge information in a human editable form.

4. For the information to be stored in a space efficient manner, and to
be able to determine the revisions merged into a given file/director in
a time efficient manner.

5. Still getting a conservatively correct answer (not worse than what we
have now) when no merge info is specified.

6. To be able to collect, transmit, and keep this information up to date
as much as possible on the client side.

7. To be able to index this information in the future order to answer
queries

Specific Non-goals for *this design* include:
1. Doing actual history sensitive merging
2. Curing cancer (aka being all things to all people)

When reading the design presented here, please remember that it is
impossible to get something perfect in subversion on the first try, and
attempting to nit pick this to death will not actually help anything,
but it would be very annoying. This is not to dissuade people from
suggesting design changes, but if you plan on suggesting a different
revision list format because you believe colon doesn't have a good level
of synergy with existing separators, or something, you may want to
rethink whether it really matters.

Some pre-notes:
The one argument i continually have with myself is whether to store info
in revprops, or just on dirs and files. If you want to try to
convincingly argue one way or the other, go for it. Certainly, I think
it makes certain semantics clearer on what operations do below and how
to proceed easier, the question is whether it is efficient enough time
wise when we go to retrieve merge info, and whether it complicates what
merge has to do too much. It also removes all of the listed
pre-reqs :).

One could also try to argue that we should start with exactly the same
cases svnmerge does (IE only allow merge info at the wc roots, only
store it on that directory, etc), with a nicer integrated interface, and
try to expand it from there. I am open to such an argument as well. :)

Anyway, on with the design.

Information storage

The first question that many people ask is "where should we store the
merge information" (what we store will be covered next).

After a large amount of research, the design I have come up with is
this:
A merge info property, named SVN_MERGE_PROPERTY (not the real name, I
have made it a constant so we can have a large bikeshed about what to
really call it) stored in the revision properties, directory properties,
and file properties.
Each will store the *full, complete* list of current merged in changes,
as far as it knows. This ensures that the merge algorithm and other
consumers do not have to walk back revisions in order to get the
transitive closure of the revision list.

The way we choose which of file, dir, revprop merge info to use in case
of conflicts simple system of inheritance[1] where the "most specific"
place wins. This means that if the property is set on a file, that
completely overrides the directory and revision level properties.

The way we choose which to store to depends on how much and where you
merge, and will be covered in the semantics.

The reasoning for this system is to avoid having to either copy info
everywhere, or crawl everywhere, in order to determine which revisions
have been applied. At the same time, we want to be space and time
efficient, so we can't just store the entire revision list everywhere.

As for what is stored:

For the large number of people i have talked to and heard about from
others, it seems the human editable *format* of how svnmerge stores
merge information (IE pathname and list of revisions) is fine. Binary
storage of such information would buy, on average, a 2-3 byte decrease
per revision/range in size over ascii[1], while making it not directly
human editable.

As such, i have chosen to represent the revisions we have merged *into*
something as a path, a colon, and then a comma separated revision list,
containing one or more revision or revision ranges. Revision range end
and beginning points are separated by "-".

So the grammar looks something like this

revisionrange -> REVISION "-" REVISION

revisioneelement -> revisionrange | REVISION

revisionlist -> (revisionrange | REVISION)(COMMA revisioneelement)*

revisionline -> PATHNAME COLON revisionlist

top -> revisionline (NEWLINE revisionline)*

This list will *not* be stored in a canonicalized minimal form for a
path (IE it may contain single revision numbers that could be ranges).
This is chiefly because the benefit of such a canonical format (slightly
easier *comparison*, but not indexing) is heavily outweighed by the fact
that generating a canonical form may require groveling through a lot of
information to determine what that minimal canonical form is. In
particular, it may be that the revision list "5,7,9" is, in minimal
canonical form, "5-9", because 6 and 8 do not have any affect on the
pathname that 5 and 9 are from.
Canonicalization could be done as a server side post pass because the
information is stored in properties.

Note that this revision format will not scale on its own if you have a
list of million revisions. None will easily. However, because it is
stored in properties, one can change the wc and fs backends to simply do
something different with this single property if they wanted to.
Given the rates of change of various very active repositories, this will
not be a problem we need to solve for many many years.

Information updating:
Each operation you can perform may update or copy the merge info
associated with a path, file, or revision.

svn add: No change to merge info
svn delete: No direct change to merge info (indirectly, because the
props go away, so does the merge info for the file)
svn rename: No change to merge info
svn copy: Copies the merge info from the source path to the destination
path, if any.

This includes copying info from revprops, if necessary, by determining
if the merge info exists in a revprop for the last changed commit for
the source path, and copying it to the new revprop if it does (someone
probably needs to check if this is the right semantic :P)

All copies are full-copies of the merge information.

svn merge: Adds or subtracts to the merge info, according to the
following:

Where to put the info:
1. If the merge target is a single file, the merge info goes to the
property SVN_MERGE_INFO set on that file.
2. If the merge target is a non-wc-root directory, the merge info goes
to the property SVN_MERGE_INFO set on the directory
3. If the merge target is a wc-root directory, the merge info goes to
the property SVN_MERGE_INFO set on the revprop.

What info is put:
1. If you are merging in reverse, revisions are subtracted from the
revision lines, but we never write out anti-revisions. Thus, if you
subtract all the merged revisions, you just get an empty list, and if
you do a reverse merge from there, you still get an empty list
2. If you are merging forward, the revision(s) you are merging is added
to the revision line in sorted order (such that all revisions and
revision ranges in the list are monotonically increasing from left to
right). The exact details of how the range is represented in terms of a
list of single revs, or a revision range, is left as a quality of
implementation detail. The only requirement is that the range be
correct.
3. The path (known as PATHNAME in the grammar) used as the key to
determine which revision line to change is the subdirectory path being
merged from, relative to the repo root, with the repo url stripped from
it.

Thus a merge of revisions 1-9 from http://foo.bar.com/reposroot/trunk
would produce "/trunk:1-9"

cross-repo merging is a bridge we can cross if we ever get there :).

pre-reqs for this design:

1. Need to be able to set a revprop to be stored on commit
2. Need to be able to say to copy a revprop from a particular revision
and only contact the server at commit time.

2. Need to be able to have auth treat SVN_MERGE_PROPERTY revprop
differently from other revprops (either by special casing the cases
users do care about controlling, or special casing props users don't
care about controlling, etc) so that people who don't have access to the
revprops can still do history sensitive merges of directories they do
have access to.

Random questions and answers

What happens if someone commits a merge with a non-merge tracking
client?
It simply means the next time you merge, you may receive conflicts that
you would have received if you were using a non-history-sensitive
client.

Can we do without the revprop portion of this design?
Technically yes, AFAIK, but it may require more crawling and querying at
merge time.

Can we do history sensitive wc<->wc merges without contacting the serve?
No. But you probably couldn't anyway, even if the revprop not being
stored locally issue were not here.

What happens if the info is not there?
The same thing that happens if the info is not there now.

What happens if a user edits merge info incorrectly?
They get the results specified by their merge info.

How does the revprop stay up to date?
We copy it from revision to revision.

What happens if a user manually edits a file and unmerges a revision (IE
not using a "reverse merge" command), but doesn't update the merge info
to match?
The merge info will believe the change has still been merged.

What happens if i svn move/rename a directory, and then merge it
somewhere?
This doesn't change history, only the future, thus we will simply add
the merge info for that directory as if it was a new directory. We will
not do something like attempt to modify all merge info to specify the
new directory, as that would be wrong.

I don't think only that copying info on svn copy is correct, what if you
copy a dir with merge info into a dir where the dir has merge info,
won't it get the info wrong now?

No.

Let's say you have

a/foo (merge info: /trunk:5-9
a/branches/bar (merge info: /trunk:1-4)

If you copy a/foo into a/branches/bar, we now have

a/branches/bar (merge info: /trunk:1-4)
a/branches/bar/foo (merge info: /trunk:5-9)

This is strictly correct. The only changes which have been merged into
a/branches/bar/foo, are still 5-9. The only changes which have been
merged into /branches/bar are 1-4. No merges have been performed by
your copy, only copies have been performed. If you perform a merge of
revisions 1-9 into bar, the results one would expect that the history
sensitive merge algorithm will skip revisions 5-9 for
a/branches/bar/foo, and skip revisions 1-4 for a/branches/bar.
The above information gives the algorithm the information necessary to
do this.

So if you want to argue svn copy has the wrong merge info semantics,
it's not because of the above, AFAIK :)

I'm sure that even in this long document, I've forgotten some things i
did spec out.
Apologies in advance.

Footnotes:
[1] This is not going to be a full blown design for property
inheritance, nor should this design depend on such a system being
implemented.

[2] Assuming 4 byte revision numbers, and repos with revisions numbering
in the hundreds of thousands. You could do slightly better by variable
length encoding of integers, but even that will generally be 4 bytes for
hundreds of thousands of revs. Thus, we have strings like "102341" vs 4
byte numbers, meaning you save about 2 bytes for a 4 byte integer.
Range lists in binary would need a distinguisher from single revisions,
adding a single bit to both (meaning you'd get 31 bit integers), and
thus, would require 8 bytes per range vs 12 bytes per range. While 30%
is normally nothing to sneeze at space wise, it's also not significantly
more efficient in time, as most of the time will not be spent parsing
revision lists, but doing something with them. The space efficiency
therefore does not seem to justify the cost you pay in not making them
easily editable.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri Apr 28 18:35:12 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.