[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Using md5sum for svn status

From: Mark Phippard <MarkP_at_softlanding.com>
Date: 2005-04-27 15:21:39 CEST

Philip Martin <philip@codematters.co.uk> wrote on 04/26/2005 10:16:04 PM:

> Mark Phippard <MarkP@softlanding.com> writes:
>
> >> To discover that there is some difference between two files, you must
> >> read, on average, halfway through both files -- once you encounter a
> >> mismatch, you can stop. (This why Unix 'diff' and 'cmp' are not the
> >> same thing.)
> >
> > Your answer seems to assume that once we reach this case that finding
> > differences is what we would expect. Is that really the case?
>
> In some cases it's true, in others it's false.
>
> > If the
> > timestamps differ, but the sizes are the same I would expect that more
> > often than not something modified the timestamp but the files are
still the
> > same.
>
> Are you aware that "broken" timestamps get "fixed" by operations which
> both take a write lock and check timestamps? That means that anyone
> using commit, cleanup or revert will generally not have broken
> timestamps.

Sure, although in the case of cleanup this wasn't the case until 1.2.
Prior to that change if just your timestamps were off, then commit and
revert could not help because ultimately the byte by byte compare would
reveal that the files were the same. Really, the only thing you could do
was a fresh checkout.

Ideally, when this compare happens, and the files are the same, the
timestamps could be fixed. I seem to recall from past discussions that
would be problematic to try and do. I think the problem is that the user
does not necessarily know that they have this problem, and that running
cleanup could resolve it. They just think that Subversion is slow. The
use-commit-times bug compounds this problem since it creates this
situation.

My only point was that the current algorithm optimizes for the scenario
where the files are the same number of bytes, but otherwise do have
differences that are going to be revealed. We ought to take a step back
and try and decide if we think that scenario is more likely than the one
where there are no differences and consider if other algorithms might be
better suited. Of course someone would first have to prove that doing
something like comparing hashes is significantly faster (at least in some
cases) than doing a byte by byte compare.

Hopefully the use-commit-times bug was the main source of this problem and
once 1.2 gains prevalance in the user base this issue will mostly just
fade away.

Mark

_____________________________________________________________________________
Scanned for SoftLanding Systems, Inc. by IBM Email Security Management Services powered by MessageLabs.
_____________________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Apr 27 15:23:44 2005

This is an archived mail posted to the Subversion Dev mailing list.