Re: Finding other descendants - efficiently

From: Michael Sinz <Michael.Sinz_at_sinz.org>
Date: 2007-04-04 00:34:02 CEST

C. Michael Pilato wrote:
> Michael Sinz wrote:
>> The question of finding the copy sources (see "[RFC] Identifying
>> copy/move sources" email thread) rekindled a "need" to do the inverse
>> - finding where a file/revision was branched/copied to.
>>
>> That is, it would be great to find where a specific revision of a file
>> was copied to such that I could, for example, track down all branches
>> that may need a specific fix or other "genealogy" type attributes from
>> the trunk or source file. (One that I was hoping to provide is a way
>> to find the "branches" that could be merged back to this part of
>> trunk, but that is being solved in our development process and naming
>> conventions)
>
> This is a very common request from CollabNet's Subversion-using customers,
> and a key part of the auditability aspect of version control.

Yes - which is an internal issue at some of my clients too...

I am glad that I am not the only one to be thinking about this. At least
it means that I am not alone in my pain (or strange requests).

>> Right now, this process is hard for a few reasons:
>>
>> 1) Using normal interfaces you can only really find the source of a
>> copy, not the destination(s).
>
> "normal interfaces" ?

Well, without building something on top of it with, say, sqlite or some
other information tracking system. :-) Maybe I should have said "built-
in interfaces" ?

>> 2) Even if you get the whole repository log (including the path
>> details which is needed to find the copies) it does not always
>> indicate the file but just the directory. This complicates things,
>> especially when the tag/branch came from a mixed revision working
>> copy.
>
> Complicates, yes, but I think the data integrity is still intact. What's
> *not* there that really, really ought to be, though, is a way to distinguish
> files from directories in the log changed-paths list. :-(

Yes, that would be nice - just a simple trailing "/" would be enough :-)
But, given some of the other complications, this may not be the major item.

>> 3) Even if we can solve the #2 complexity, there is the fact that
>> getting the whole log from a large repository with many revisions is
>> very, very costly.
>
> Yup. (Though softened if you implement a client-side cache of
> already-parsed revisions.)

Yes, softened. Depends on what one caches and how many copies are made.
If you cache just the copy results, you can significantly reduce some of
the overall data transmission costs. In fact, with some careful processing
this could be a reasonable "incremental" process...

>> Now, I understand that this may not be easily done. In fact, given
>> how the FS works, there is little in the way of any forward linkages
>> (hey, it is a DAG, so why would there be). Also, the fact that past
>> revisions do not get modified is a major win for a number of reasons,
>> so where would such information be stored. (I have thought about
>> doing this in a post-commit hook and storing some path/revision texts
>> within a revprop but...)
>>
>> Does anyone have a better idea of what could be done? Maybe the
>> performance and size of the log issue will get greatly reduced by the
>> --only-copies capability. That may make generating the genealogy data
>> a bit easier for larger systems. (Maybe that is really the only
>> reasonable answer at this time?)
>
> The algorithm for answering the question, "Where live all the files whose
> history includes PATH@REV?" involves finding all copies made from PATH
> between REV and HEAD, plus those made from any of PATH's parent directories
> between those same revisions, plus copies of those copies, copies of the
> copies of those copies, etc., until all successors have been checked.

This is where the search gets annoying - I had almost forgotten about some of
the nasty cases of parent-parent copies that change rooting in the next stage.
(Mainly because our normal case usage does not have that but some of my test
cases did, mainly due to code reorganizations, which makes this really "fun".

> I
> think you might get to rule out soft-copies (since there should be some real
> copy that covers that case elsewhere in the crawl). And having successor
> links in the FS DAG (just like we have predecessor links) would certainly
> help alot. Being able to look at a node and say, "where are your
> descendents" and getting a list that includes the next version in that line
> of history plus all the nodes made as copies of the one in question helps.
> But I think you're still looking at a sort of brute force crawl.

Given the current structure, that is true. I am trying to see if I can somehow
get some generational tracking into our system (or maybe, once I get a good
idea of how, propose some solution for Subversion)

-- 
Michael Sinz                     Technology and Engineering Director/Consultant
"Starting Startups"                                mailto:michael.sinz@sinz.org
My place on the web                            http://www.sinz.org/Michael.Sinz
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Received on Wed Apr 4 00:34:44 2007

This message: [ Message body ]
Next message: Michael Sinz: "Re: [RFC] Identifying copy/move sources"
Previous message: Max Bowsher: "Re: "SVNAuthorizationShortCircuit or something similar""
In reply to: C. Michael Pilato: "Re: Finding other descendants - efficiently"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]