[PROPOSAL] Database schema for storing merge tracking information

From: David James <djames_at_collab.net>
Date: 2006-05-03 02:33:55 CEST

Last week, Daniel Berlin posted a design proposal for tracking merges
in Subversion. Dan proposed keeping track of the list of revisions
merged between each pair of targets using properties stored on the
target file/directory. This proposal is now archived on our website
at http://subversion.tigris.org/merge-tracking/design.html

Quoting from design.html, "The current thinking is that the payload
of SVN_MERGE_PROPERTY will be stored in an index separate from the
FS which is created during svnadmin create. This index will support
fast querying, be populated during a merge or svnadmin load, and
cough up its contents during a svn propget SVN_MERGE_PROPERTY or
svnadmin dump. The contents of SVN_MERGE_PROPERTY will not be
stored redundantly in the FS (only in the index). Dan Berlin is
prototyping this index using sqlite3, and David James has a
(generic) schema design underway."

I'd like to propose a schema for this SVN_MERGE_PROPERTY index.
Please read my proposal below.

GOALS
=====

1) Efficient lookup: Finding out which revisions were merged to or
   from a specific path should be fast and simple.
2) Efficient copy: Copying one branch to a different location should
   be an O(1) operation in terms of the size of the branch and/or the
   number of revisions merged to it.
3) Versioning: All merge tracking information should be versioned. Old
   merge tracking information should never be deleted. Further, it
   should be possible to track and audit updates to the merge tracking
   information.

NON-GOALS
=========

1) No single-step recursive lookups: If you want to lookup the merge
   tracking information for parents of a particular path, you'll need
   to lookup this info for each path separately.
2) No lock-in to specific database libraries: I don't intend to pick
   the perfect database library or storage format. I'm describing a
   general schema, which is suited to a relational database, but could
   conceivably be implemented in any number of ways.

SCHEMA
======

Now, on to the fun part: the database schema. I've listed both
tables below, with the column names in CAPS.

HistoryForks: HISTORY_FORK_ID, MERGE_TARGET_PATH, SNAPSHOT_REV, KILL_REV
Merges: HISTORY_FORK_ID, MERGE_SOURCE_PATH, ENTRY_REV, KILL_REV,
COMMIT_REV, MERGE_START_REV, MERGE_END_REV

All columns are stored as integers, except MERGE_SOURCE_PATH and
MERGE_TARGET_PATH.

HistoryForks maps each MERGE_TARGET_PATH to a series of history forks,
each of which is denoted using a HISTORY_FORK_ID. We look at each
history fork as of SNAPSHOT_REV. If a path has no copies, it will
contain at most single history fork. This single history fork,
known as the "active history fork", will have a SNAPSHOT_REV of
MAX_INT, so as to indicate that the history fork is still active,
and that future changes to that fork of history should continue to
affect this path. (We'll describe the complex cases later.) The
KILL_REV is also set to MAX_INT so as to indicate that this history
fork is still valid.

Merges maps each HISTORY_FORK_ID to a series of merges. Each merge has
a unique MERGE_ID, and contains a pair of revs (MERGE_START_REV,
MERGE_END_REV), and a MERGE_SOURCE_PATH. The range
MERGE_START_REV:MERGE_END_REV indicates which revisions were merged.
The range ENTRY_REV:KILL_REV indicates the range of revisions in which
this record was valid. COMMIT_REV is the rev in which the merge was
committed. ENTRY_REV is the rev in which this merge entry was added.
The KILL_REV is initially set to MAX_INT so as to indicate
that this merge is always valid.

If a copy is made of a particular path, we copy all of the history
forks with the old PATH from HistoryForks over to the new PATH.
This takes O(N) time in terms of the number of copies that have
occurred of a particular path, but O(1) time in terms of the size of
the branch and the number of merges. The SNAPSHOT_REV for the
previously active fork of history is set to the rev that we are
copying from.

If a new merge is made to a particular path, we first check if that
path has an active history fork. If it does not, then a new history
fork is created. Once we have an active history fork, we add a new
merge record for that fork which starts its life at the current
revision and lasts until MAX_INT.

If a set of revisions is rolled back from the active history fork, all
intersecting merges are killed by setting the KILL_REV for those
merges to the current revision. If a merge is partially rolled back,
then the entire merge is invalidated and new records are created in
the Merges table to show the portions of the merge that are still
valid.

If revisions are rolled back from an inactive history fork, then we
set the KILL_REV for the inactive history fork and copy the merge
records from the old history fork into the active history fork. After
copying the merge records, we follow the regular procedure for rolling
back merges from the active history fork.

Constraints:
1. Each UNIQUE (HISTORY_FORK_ID, MERGE_TARGET_PATH) tuple refers to a
   fork of history whose merges affect a particular merge target path.
2. Each UNIQUE (HISTORY_FORK_ID, MERGE_SOURCE_PATH, COMMIT_REV,
   ENTRY_REV, MERGE_START_REV) tuple refers to a merge from
   MERGE_SOURCE_PATH to HISTORY_FORK_ID in COMMIT_REV. This merge
   record was created in ENTRY_REV.

FAQ
===

1. Why do we need the HistoryForks table? Instead, we could just
associate merges directly with paths, saving a level of indirection.

The big advantage of the "HistoryForks" table: We can copy branches
without making a copy of their associated merge-tracking information.
In the vast majority of cases, this historical merge-tracking
information is never modified, and therefore we can simply refer back
to the old history in any copies of the branch.

Copying merge-tracking information doesn't just waste space; it also
wastes time. By doing less copying, we don't just save time up-front,
but we also save time when we need to scan through all this
information to find the relevant bits.

In some cases, we do need to modify historical information; for
example, if you copy a branch, and then rollback a revision that was
merged on the original branch, we need to mark the old revision as
rolled back. To do this, we invalidate our old history fork, and copy
the old merge tracking information over to the new history fork.

2. What's the purpose of SNAPSHOT_REV?

When a branch is copied, we use SNAPSHOT_REV to refer back to a
'snapshot' of the merge history at the time of the copy. If the merge
history of the old branch is modified after this point, such changes
should not be propagated on the new branch. We avoid propagating the
changes to the merge tracking history by asking all copies of the
branch to look at the merge tracking history as of SNAPSHOT_REV.

3. When can merge-tracking history be canonicalized?

We can canonicalize merges at any time by adjusting MERGE_START_REV
and MERGE_END_REV of specific merge records to account for "phantom
revisions", or revisions which were not merged but can be treated as
merged because they do not make any changes to the source branch. Such
a canonicalization step could simply modify existing records and would
not require any new entries. One advantage of this form of
canonicalization is that any canonicalization performed on one branch
would automatically be propagated to any copies which reference this
historical information.

4. What's the difference between COMMIT_REV and ENTRY_REV?

COMMIT_REV is the revision in which a particular merge was committed.
ENTRY_REV is the revision in which the info about the merge was added.
COMMIT_REV may differ from ENTRY_REV if merge information is edited
after the commit.

5. Why do you refer to path strings directly from your merge records?
   If many merges affect the same path, you might store the same path
   several times in the DB. You could save space by storing all of the
   paths in a separate table, and referring to them by number.

It's simpler to refer directly to paths instead of path ids. If we
created a separate table to store these paths, we'd need an extra
lookup to convert path ids into real paths. I don't know whether
this extra lookup would affect performance, but I'd rather stick
with the simpler approach until we run some benchmarks.

That's all folks!

Cheers,

David

--
David James -- http://www.cs.toronto.edu/~james
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Wed May 3 02:34:31 2006

This message: [ Message body ]
Next message: David James: "Re: memory leaks in ruby bindings"
Previous message: Edward Harvey: "RE: Re: Atomicity of locks and needs-lock"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]