Re: Why do we check the base checksum so often?

From: Hyrum K Wright <hyrum.wright_at_wandisco.com>
Date: Sat, 4 Feb 2012 12:10:01 -0600

On Sat, Feb 4, 2012 at 10:59 AM, Julian Foad <julianfoad_at_btopenworld.com> wrote:
> Hyrum K Wright wrote:
>
>> Julian Foad wrote:
>>> Hyrum K Wright wrote:
>>>> The Ev2 shims get in the way of how text deltas are transmitted, by
>>>> reconstituting the full text, and then just streaming that to the
>>>> receiver via svn_txdelta_send_stream(). I've got a patch which
>>>> actually starts reporting the base checksum---which with the shims
>>>> will always be the "empty" checksum---and it turns out that
>>>> such a patch breaks the World.
>>>>
>>>> The reason for this breakage is that there are several places in both
>>>> the FS and the WC that we check the delta editor's reported base
>>>> checksum against some other value we have on hand which we *think*
>>>> should be the base. Until now, these checks have always passed, since
>>>> there was an implicit understanding about what the delta editor would
>>>> use as its base.
>>>>
>>>> However, I think that these checks are wrong. They rely upon an
>>>> implementation detail ("is the delta editor sending a text delta
>>>> against the base we think it ought to?") rather than the result ("did
>>>> we end up with the content we expected to end up with?")
>>>
>>> When we (the WC update code for example) receive a text delta, we apply it
>>> to a text base that we already have, in order to create a new text. We
>>> need to be applying it against the correct base [...]
>>
>> I understand this principle, but I don't think that's what the API
>> is/should be doing. The apply_textdelta callback is essentially
>> saying "apply this delta against the base with this checksum". In the
>> current regime, we know a priori what that base "should" be, so we
>> make sure that apply_textdelta spits that information back to us.
>>
>> But I don't think that's always a valid assumption. If the delta
>> editor chose some other base to use (in this case, the empty stream),
>> and indicated that through the apply_textdelta() base checksum
>> parameter, a receiver should be happy to accomodate that request.
>> "Why should I use the base you told me to use, when I can use this one
>> more efficiently?"
>
> We're talking here about the delta editor (Ev1). The driver shouldn't have free rein to choose any base, because the receiver does not have all possible bases at hand ready to apply the delta onto. At least in the server-to-client direction (update etc.) the client probably only has one suitable base text per possible file.

This statement is false. The server always has *two* potential delta
bases to chose from, the empty stream being one of them, as you
mention below.

> Either the server would have to be told what base texts it could choose from, or the client would potentially not be able to apply the delta until it first asks the server to send it the relevant base text, which would pretty much negate the point of having deltified in the first place. In the other direction, of course, we can now start to design protocols where the client picks any base text that it knows exists in the repository, and the server could be able to access it, now we have the rep-cache and the idea of looking up texts by their checksum. But ... that can't be what you're thinking of, I'm sure.

I'm thinking of a much simpler scenario: if the client doesn't have
the required base, it simply errors out. "I told you to use base X,
you decided to use base Y. Since I don't have base Y, I'm going to
return an error to let you know that."

> The empty stream is a special case. It's valid suggestion to say the driver should have the option of sending a full text, or a delta against an empty stream which is semantically the same thing. But retro-fitting that onto Ev1 isn't interesting at this point.

Oh, I don't know about that. All this base checksum checking is
already conditional on there even being a base checksum supplied by
apply_textdelta(). We could just as easily ignore the base checksum
if it were for the empty stream as well.

> Now, if we talk about Ev2 (I know you're actually looking at the shims between the two), then we've explicitly designed that the mechanism for transferring texts is outside the scope of the editor iteself and so the driver and receiver code are responsible (assisted by respective layers above them) for co-ordinating in any way they want to. The Ev2 solution for deltifying text between driver and receiver could include (warning: possible hair-brained ideas): the receiver telling the driver what base texts it has available; the driver first choosing a base that's convenient for it, and letting the receiver request that base from the driver (out of band) if the receiver doesn't have it available; and so on.

Implementation details. We can worry about the underlying
deltification schemes of the various transport layers when we get to
them.

> I'm not quite sure I fully follow you at the moment, so I'm not sure if my reply is on the right track at all, but it's really sounding like you're up against a mis-match of responsibilities between Ev1 which sends deltas according to particular rules and Ev2 which is designed to be wrapped inside a driver-receiver pairing that knows privately how to deltify and recover to full text in any way it wants to. The shims obviously need to convert from the Ev2 deltification back (via a full text intermediary if necessary) to what Ev1 expects.

What's driving this discussion is this: Up until this point in the
Ev2 shims we've been supplying a NULL base checksum for apply
textdelta, which the receivers have dutifully ignored. However, when
the Ev2 shims attempt to be honest about the fact that they are using
the empty stream for the text base, the receivers start complaining,
because that's not what they expected---even though the end result is
the same. In essence, all these checks are returning false positives,
which is extremely unpleasant.

I don't know that there is an easy way around this, since by the time
we're translated from Ev2->delta-editor, we don't have the original
text base, or its checksum, available to us. We have the full text,
which is the reason the new text base is the empty stream: it's the
only one we need.

Does that make any sense?

-Hyrum

PS - In response to Burt's comment about MD5 uniquely identifying
bases, I would agree. Though I think special casing for the empty
stream, rather than arbitrary potential bases, is still reasonable.

-- 
uberSVN: Apache Subversion Made Easy
http://www.uberSVN.com/

Received on 2012-02-04 19:10:35 CET

This message: [ Message body ]
Next message: Hyrum K Wright: "Re: Why do we check the base checksum so often?"
Previous message: Bert Huijben: "RE: Why do we check the base checksum so often?"
In reply to: Julian Foad: "Re: Why do we check the base checksum so often?"
Next in thread: C. Michael Pilato: "Re: Why do we check the base checksum so often?"
Reply: C. Michael Pilato: "Re: Why do we check the base checksum so often?"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]