[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: svn_fs_fs__prop_rep_equal: non-strict mode, consistency, and MD5

From: Julian Foad <julianfoad_at_gmail.com>
Date: Sun, 27 Sep 2015 09:04:27 +0100

A few words on the motivation here, and (further down) a follow-up
regarding the correctness of the patch I proposed.

I was aware when Stefan was changing this area of the code, and so I
had the opportunity to review at the time, but I've only now started
paying so much attention to this.

I am looking into the behaviour change of
svn_fs_{props,contents}_changed(), as part of the 'dump of no-op
changes' issue. I expect we'll want to resolve that issue by restoring
the original (1.8) behaviour, but first I want to discover exactly
what that 1.8 behaviour is. I rather expect there will be
inconsistencies in it. If so, we'll have to choose whether to try to
restore exact (bug-for-bug) compatibility with 1.8 or just
approximately the same behaviour.

I was not expecting to find inconsistencies in the 1.9 behaviour of
those APIs. Although we decided (and documented) that the those APIs
are promising only to do a 'best effort quick check', I now think it
would be much better to keep their behaviours as solid and consistent
as possible. (Consistency between text and props, between degrees of
certainty, between reps in revs and in txns, and so on.)

Most of this thread concerns the '_changed' behaviour. If we're going
to resolve that 'dump' issue by reverting the '_changed' behaviour,
then some of what we're discussing here will become moot. I still
think it's worth discussing first, at least so I understand it. We
don't have to commit any patch just yet.

Behaviour of the 'svn_fs_*_different' APIs ought to be
uncontroversial. One point, however, affects these: the issue of
reliance on MD5 equality.

(more below...)

On 25 September 2015 at 19:01, Julian Foad <julianfoad_at_gmail.com> wrote:
> (New thread, taken from an observation in the "No-op changes no longer
> dumped..." thread. In that thread, I hadn't spotted that it is only
> the props comparison.)
>
> Starting from 1.9.0, the FS API content comparison methods
>
> svn_fs_contents_changed()
> svn_fs_contents_different()
> svn_fs_props_changed()
> svn_fs_props_different()
>
> are implemented in FSFS by svn_fs_fs__dag_things_different() which calls
>
> svn_fs_fs__prop_rep_equal(strict=TRUE/FALSE)
> and/or
> svn_fs_fs__file_text_rep_equal(strict=TRUE/FALSE)
>
> * svn_fs_fs__file_text_rep_equal() uses MD5 checksum as a quick check,
> returns 'not equal' if MD5s do not match, else gets a definitive
> answer by comparing SHA1s (if present) or full text [1]. That's fine.
>
> * svn_fs_fs__prop_rep_equal(), by contrast, reports that two
> properties-reps are equal if their MD5 checksums are equal. [2]
>
> These functions were introduced in
> http://svn.apache.org/viewvc?view=revision&revision=1572336. The log
> message indicates relying on MD5 equality for properties was
> intentional, but is it consistent with the general guarantees we want
> from Subversion? I think we generally want to rely on at least SHA1
> equality, and I think we should apply that rule for all data.
>
> If we change this to return early only if MD5s differ, else fall
> through to a full check if MD5s match, the single run-time cost of
> falling through is not severe (and is a hit that we take anyway
> whenever one set of props is in a txn rather than a revision), and the
> frequency of hitting this code path would be near zero -- only cases
> where there is an MD5 collision. So I think not relying on MD5
> equality here is a definite improvement.

Oops, my think-o: When I said "near zero frequency", of course that's
only if the properties are in fact different. In practice this case is
probably hit very frequently because we're comparing properties that
are identical.

> In terms of severity, I think we should backport "don't rely on MD5
> equality" to 1.9.x, but with no special urgency.
>
> FSFS and FSX are affected equally; BDB is unaffected, as its code in
> txn_body_props_changed() doesn't use the MD5 shortcut.
>
> On looking further, I found more concerns with
> svn_fs_fs__prop_rep_equal(). In non-strict mode it behaves quite
> differently for (rev:rev) comparison than for a (rev:txn) or (txn:txn)
> comparison. I don't like this; I think that is likely to lead to
> problems.
>
> The four comparison methods in the FS API are tested by
> tests/libsvn_fs/fs-test[.c] 48.
>
> * This test only tests (rev:rev) comparisons.
> * This test looks includes a case where text and a property are set
> to two versions of 'evil' text thay yield the same MD5 sum. I
> initially thought this case would ensure test coverage for the unusual
> (MD5 collision) case for all functions under test, but it does not.
> The 'evil' text forms only a substring of the property-list rep, and
> the MD5 sums of the full reps no longer collide.
>
> Also it's confusing because the test sub-case 'evil text' (i=3)
> doesn't actually end up executing the code path for finding identical
> MD5s of the reps, because the special text is just the value of one
> prop and not the complete rep of the properties list.
>
> I propose the attached patch, for a start. It doesn't do anything to
> resolve the change (regression) of the exact behaviour of non-strict
> mode comparisons, which is the subject of another thread. It only
> addresses the issues listed in this email.

Is it true that properties reps are stored in a canonical form, in all
FSFS formats that this code path might be reading?

The patch I proposed assumes that properties reps are stored in a
canonical form, so that finding a difference as opaque binary blobs is
sufficient to imply they represent different set of properties. If
that is not true, then of course we need to modify the patch to fall
back to parsing the reps and doing a property-hash comparison if the
quick checks indicate inequality.

If so, why was the current code parsing the props to compare them; and
if not, why was it relying on MD5? Is is because they're stored
canonically in one case but not in the other case? Need code comments,
please, to explain this if that's so.

If there is a good reason to have dedicated code for props-rep than
for text-rep comparison, then can we please still define them with
equivalent behaviour?

- Julian
Received on 2015-09-27 10:04:57 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.