On Tue, Apr 14, 2015 at 3:28 AM, Branko Čibej <brane_at_wandisco.com> wrote:
> On 13.04.2015 18:04, Stefan Fuhrmann wrote:
> > On Mon, Apr 13, 2015 at 11:20 PM, Daniel Shahaf
> > <d.s_at_daniel.shahaf.name <mailto:d.s_at_daniel.shahaf.name>> wrote:
> >
> > rhuijben_at_apache.org <mailto:rhuijben_at_apache.org> wrote on Mon, Apr
> > 13, 2015 at 14:28:39 -0000:
> > > - *has_props = (noderev->prop_rep->expanded_size > 4);
> > > + *has_props = (noderev->prop_rep->expanded_size > 4
> > > + || (noderev->prop_rep->expanded_size == 0
> > > + && noderev->prop_rep->size > 4));
> >
> > Having to remember, on every use of EXPANDED_SIZE, to check if
> > it's zero
> > is madness.
> >
> >
> > The pain has been increasing and is now at a level
> > where we should do something about it.
> >
> >
> > Is there a good reason not to make the deserializer
> > function handle this? If needed, by inventing a getter or a new
> > struct
> > member that doesn't have this quirk?
> >
> >
> > The main problem is that we don't know whether SIZE is
> > correct, either, without looking at the actual representation:
> > It could be a self-delta of an empty fulltext. So, most of
> > these tests are contextual, we e.g. know that these are
> > property reps. Or we can allow for false positives or false
> > negatives etc.
> >
> > One way to get around this problem is to basically use
> > the approach of svn_fs_fs__file_length(), where we compare
> > the contents checksum with with that of an empty rep.
> > We might also restrict that check to "meaningful" values
> > of "SIZE", i.e. those txdelta streams that produce an
> > empty output.
> >
> > Assuming we find a truly reliable test (e.g. say txdelta is
> > always 4 bytes and we can enumerate all MD5s), we could
> > add it to the parser function in low_level.c.
> >
> > Thoughts?
>
> I believe that currently a txdelta representation will always be at
> least 4 bytes long; it's represented in svndiff format, which has a
> 4-byte magic header. Unless something has changed since I wrote that
> part of the code years ago, we'll discard delta representations that are
> larger than the fulltext, which implies that empty reps will always be
> stored as fulltext.
>
That should still be the case. So, repositories written
exclusively by SVN will be easy to check. For others,
there is a fallback (see below).
> I'd hate to rely on such heuristics in general, but for this specific
> case, if they turn out to be valid ... and as long as we really
> carefully document and cross-reference the assumptions ... and so on.
>
I wrote a simple test to verify (in less than 12mins) that
no 4 byte sequence produces the same MD5 as the
empty content. So, we can resolve EXPANDED_SIZE
immediately after parsing like this:
* EXPANDED_SIZE > 0 -> Done.
* MD5 != MD5(empty) -> EXPANDED_SIZE := SIZE. Done.
* SIZE == 4 -> Done.
At that point, the rep may be either a non-SVN-generated
txdelta - in which case EXPANDED_SIZE must already
be correct - or is a PLAIN rep with MD5 hash collision.
Both are extremely unlikely but can be solved by reading
the representation header. Then both cases can be told
apart:
* PLAIN rep -> EXPANDED_SIZE := SIZE.
* Done.
-- Stefan^2.
[[[
static svn_error_t *
all_32_bit_md5(apr_pool_t *pool)
{
apr_pool_t *iterpool = svn_pool_create(pool);
apr_uint64_t i;
svn_checksum_t *checksum;
/* Be sure we get the the correct MD5 for an empty input. */
SVN_ERR(svn_checksum(&checksum, svn_checksum_md5, "", 0, iterpool));
SVN_TEST_ASSERT(svn_checksum_is_empty_checksum(checksum));
/* No 4 byte sequence shall ever produce the same MD5 as the empty input.
*/
for (i = 0; i <= APR_UINT32_MAX; ++i)
{
apr_uint32_t value = (apr_uint32_t)i;
svn_pool_clear(iterpool);
SVN_ERR(svn_checksum(&checksum, svn_checksum_md5, &value,
sizeof(value), iterpool));
SVN_TEST_ASSERT(!svn_checksum_is_empty_checksum(checksum));
}
return SVN_NO_ERROR;
}
]]]
Received on 2015-04-14 16:57:09 CEST