[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: svn_ra_get_file_revs2 vs. blame vs. FS API

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Wed, 19 Feb 2014 18:50:24 +0000

On Mon, Feb 17, 2014 at 12:28 PM, Julian Foad <julianfoad_at_btopenworld.com>wrote:

> Stefan Fuhrmann wrote:
> > r1568600 uncovered an inconsistency in our API usage / interpretation
> > making blame -g tests fail for FSX.
> >
> > The starting point is svn_fs_contents_changed and svn_fs_props_changed.
> > FSFS and probably BDB implement those as "has there been an intermittent
> > change?" E.g. if the result of some merge yields equal content on both
> sides,
> > they will still be considered "changed". That is a violation of the API
> definition
> > basically asking "is the delta empty?". Even the FSX code is not fully
> consistent
> > with that, yet.
>
> BDB and FSFS and FSX all implement the comparison in exactly the same way:
> "does it have the same rep-key?".

Well, not exactly. FSFS adds a "unifier" that tells different
instances of a shared representation apart (but uses the
same unifier when the rep does not change between
successive noderevs).

> This does not mean "is the content identical?", but it does not
> necessarily mean "has there been an intervening change to the content?"
> either.

Well for FSFS, it does the latter.

> It only means "was the same representation instance chosen to represent
> this content?". For example, I think it might be possible for this to
> return "different" even if there has been no intervening change to the
> content, if the rep cache was disabled and re-enabled during a series of
> commits, although that depends on even more implementation details.
>
> One thing we can say for certain is that, despite the doc string, these
> APIs have always been implemented to return "the same" in some cases when
> the content is known to be the same, and return "different" in cases where
> the content might be different but might not be.
>
> Some of the callers use the call to optimize whether to fetch and compare
> the content. That's fine of course.
>
> Some other callers expose this behaviour to higher level APIs, where the
> caller might be expecting a definite answer to "Does the content differ?",
> and that's where the problems begin.
>
> Both semantics are valid, of course, so we might end up wanting to provide
> both.
>

I agree. We might simply a allow_false_positives flag in
svn_fs_*_changed2. Because we already have MD5 and
often SHA1 checksums as part of the representation_t, we
don't need to do expensive full comparisons. With one
exception: when comparing props and one of them is in
a transaction, we don't know the checksums, yet.

I already prepared a patch for FSX that provides the "strict"
semantics with very little overhead.

> svn_ra_get_file_revs2 uses svn_fs_contents_changed to decide whether
> > a delta should be sent to the given handler. The false positives
> returned by
> > FSFS and BDB are benign since we simply end up sending an empty delta
> > (a mere waste of CPU cycles).
>
> And note that the doc string for svn_ra_get_file_revs2() explicitly says
> these false positives can be returned.
>

The point is that the blame implementation relies on those
false positives. I'd prefer to make false positives a rare instance.

> > However, our blame implementation relies on those false positives because
> > it tracks contents from revisions to be merged later separately from
> "mainline"
> > changes. When the merge finally happens but results in the same contents
> > as the previous branch revision, the delta handler must be called to
> pull in
> > the blame info from the branch. The latter no longer happens with the
> strict
> > interpretation of svn_fs_contents_changed.
>
> Without following the details, I think what you mean about merges is that
> the blame algorithm is trying to report an event whenever a merge happened,
> even if that merge did not result in a content change. The gist of it seems
> to be that the blame implementation assumes:
>
> * that a new content rep-key is assigned whenever a merge is committed,
> even if the content is unchanged, but not at other times when the content
> is unchanged; and
>

Yep, that is the implementation detail leak that I'm concerned about.

> * that svn_ra_get_file_revs2() will notify if and only if a new rep-key
> is assigned.
>

No. False positives are fine. Always reporting an (empty) delta should
just work. In the blame case, that is.

> or something very much like that. Those assumptions look like the real
> problem here.
>
>
> > I see the following options:
> >
> > * Redefine svn_fs_contents_changed and svn_fs_props_changed to match
> > the current implementation. I'd rather not but that just a -0 from my
> side.
>
> We should deprecate them, and adjust their doc strings with a historical
> note about the discrepancy, and replace them with one or two new (pairs of)
> APIs:
>
> -- a quick optimization API -- the definitely/maybe question -- like the
> existing implementations but documented properly and named appropriately;
> and
>
> -- (perhaps) a definite content comparison API: "has the content
> changed?"
>

I tend to prefer an extra flag parameter in a bumped version of the two
API functions as it makes existing API users aware of how the API has
been changed.

> * Make svn_ra_get_file_revs2 always accept a delta handler unless the
> > previous change was on the same branch and the delta is empty (e.g.
> > just a prop change). Sounds arbitrary, so -0 on that too.
>
> I don't understand this one at all. Do you mean "call the delta handler
> unless..."?

That is the tricky part of understanding svn_ra_get_file_revs2. Here is
how it works:

* find all "relevant" revisions (and paths, particularly with blame -g)
* send all (revision, path) pairs
  - has contents changed?
  - call handler; if contents changed set delta_handler and delta_baton
    _output_ parameters; NULL otherwise
  - handler may set *delta_handler and *delta_baton, if caller accepts
    them and handler is interested in the delta
  - (outside handler) if delta handler & baton provided, drive delta

What does "same branch" mean in Subversion?

svn_ra_get_file_revs2 also knows whether a given revision was merged
or already "interesting" for the original path. The latter is called
"mainline"
in the blame code and I used the term "branch" loosely for "same path".
Consecutive empty changes to the same path cannot be relevant to blame
while empty change between two paths may exactly be the result of a merge.

Why would a prop change be considered an empty delta?
>

Blame does not process prop changes. However, prop changes are
somewhat frequent and would (often) result in an empty file contents
delta. So, there is a point in pre-checking for equal reps at all.

> * Make svn_ra_get_file_revs2 accept delta handlers (roughly) as today
> > but use the log information for that, i.e. whenever the content
> modified
> > flag is set for the respective file path. That would be sound approach
> > since we are in rev_hunt.c anyway. The info may already be available
> > (haven't checked yet).
>
> Not sure what you mean there, exactly.
>

The "accept" bit: see above. The "modified" flag is part of the
changed paths lists we get from log info. It is basically what
FSFS tries to indicate with its false positives in *_changed:
Data has been written to this node in that rev - whether it
ended up as a true change or not.

> > * Make the blame implementation update the mainline blame info even
> > if no new delta info came in but the previous callback wasn't for the
> > mainline.
>
> Sorry, I can't follow that one either without digging into the blame code.
>

Well, according to Bert, the blame code is only an approximation
in the blame -g case. The critical sequence is:

handle-blame(r2, "/some/branch", some-delta)
handle-blame(r3, "mainline", no change) // was a merge from "/some/branch"

Today, FSFS reports a false positive different between r3 and r2
which also the blame delta handler to pull in the non-mainline
blame info into the mainline blame data (even as part of an
otherwise empty delta).

Clearly the blame implementation needs to be changed, but if we can start
> by determining what question it needs to know the answer to and define
> suitable APIs for answering it, that would be a good way to proceed.
>

Bumping that svn_fs_*_changed API would certainly be a good
starting point. Since I'm currently on "working vacation", I may find
some time to make the blame code work with the "strict" change
semantics as well. That would fix the FSX release tests.

As for svn_ra_get_file_revs2, I feel that it needs a better docstring.
We might want to bump it in the same way as the svn_fs_*_changed
API in case someone relied on undefined behaviour just like we
did with blame.

-- Stefan^2.
Received on 2014-02-19 19:50:58 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.