Re: Re: Caching text-size in the entries file

From: Erik Huelsmann <ehuels_at_gmail.com>
Date: 2006-11-09 22:16:45 CET

> > -----Original Message-----
> > From: Erik Huelsmann [mailto:ehuels@gmail.com]
> > Sent: Wednesday, 8 November 2006 11:44
> > To: Michael Haggerty
> > Cc: Peter Lundblad; SVN Dev
> > Subject: Re: Caching text-size in the entries file
> >
> >
> > On 11/7/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> > > Erik Huelsmann wrote:
> > > > On 11/6/06, Erik Huelsmann <ehuels@gmail.com> wrote:
> > > >> On 11/6/06, Peter Lundblad <plundblad@google.com> wrote:
> > > >> > Erik Huelsmann writes:
> > > >> > > Ah, we already have a speed-vs-correctness tradeoff
> > in status: we use
> > > >> > > mtimes instead of filecomparison or hash
> > calculation. BTW: I do feel
> > > >> > Yeah, but I think we need to be careful when making
> > the heuristic
> > > >> worse.
> > > >>
> > > >> Right, but what I hope to do is reduce the number of
> > false negatives:
> > > >> I assert most files which are modified (but do not have their
> > > >> timestamp changed) won't have modified keywords or eols
> > only. Rather:
> > > >> I think having modified eols-and-keyword-expansions-only
> > is an edge
> > > >> case rather unlikely to happen. Whereas we have seen cases where
> > > >> timestamps were kept constant (making the changes undetected).
> > > >>
> > > >> Currently our algorithm doesn't know any false
> > positives, but it has a
> > > >> chance for false negatives. I'd rather see that the
> > other way around:
> > > >> false positives are correctable with an 'svn revert' or
> > 'svn cleanup';
> > > >> false negatives don't have a cure.
> > >
> > > What if Erik's new algorithm is used to detect
> > > files-that-might-be-modified, then those files are
> > double-checked using
> > > the more expensive algorithm? I assume that in most use
> > cases, at most
> > > a small percentage of files are changed when 'svn stat' is run.
> > > Therefore this should give almost as large a speed win without any
> > > downsides.
> >
> > Well, that'll give us fewer false negatives, without the extra false
> > positives, but it will gain us no speedup: all files which are marked
> > 'maybe-changed' need to be detranslated to test eol- and keywords-only
> > changes. My point is that these are sufficiently edge case not to
> > require full detranslation on status: 'normally' only content changes
> > will have occurred.
> I don't know, but the way I understand it, only very few files would normally
> be de-translated.

Maybe, but maybe not: all files for which the mtime has changed
currently will be detranslated. In working copies which exist for
several months, that may be a much larger number than the files which
actually contain changes.

> Therefor you would gain some speed.

When? What I'm proposing is that we introduce a large number of cases
where detranslation *isn't* required to call a file modified (which
currently *only* happens after detranslation).

> Also, aren't there
> actually three algorithms: Full text compare (after detranslation), mtimes and
> text-size?

No, there are 2: full file compare and text-size. The point is that
full file compare (currently the only method) will only be used when
the file looks changed by its mtime (ie changed mtime).

> There ought to be a way to combine these which results in fast and
> nearly (or fully?) error-free detection of changed files.

Full error-free detection is only possible with full file compare on
every file in the working copy. Having said that, the current
algorithm says that if a file has a changed mtime it *might* be
changed and needs a full compare.

The new algorithm which Peter Lundblad proposes is to *also* require a
full file compare if the file size changed. Instead of spending less
time in status, we will now spend more time in status (and err less
often on the side of false-negatives!).

My proposal is that the full file compare on files which have a
changed file size only serves to filter out those cases where people
have edited keyword expansions or changed eols from CRLF to LF or vice
versa: extreme edge cases not worth the extra cost on the normal use
case.

Hope that makes my reasoning (and the problem) clearer.

bye,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Nov 9 22:17:35 2006

This message: [ Message body ]
Next message: Daniel Rall: "Colon in zh_CN Subversion translation file"
Previous message: Erik Huelsmann: "Re: How to set text encoding of svn command line tools"
In reply to: SebastianUnger_at_eaton.com: "RE: Re: Caching text-size in the entries file"
Next in thread: Michael Haggerty: "Re: Caching text-size in the entries file"
Reply: Michael Haggerty: "Re: Caching text-size in the entries file"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]