On Thu, Dec 2, 2010 at 4:21 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> Hi Johan.
> I've just read the whole of this thread.
> I didn't quite understand your original point (2) that "token-based
> suffix scanning will not be as fast as byte-based suffix scanning".
> Sure it won't, but is there any reason you mentioned suffix scanning
> there specifically? The same is true of prefix scanning, of course.
I mentioned suffix specifically, because for prefix scanning the bytes
approach doesn't have that same advantage. It has to count the number
of lines, to have a correct line offset for the rest of the "normal"
algorithm. So the byte-based prefix scanning needs to check for
eol-characters just like the token-based version. That means that
theoretically (implementation details aside) both approaches should be
equally fast for prefix scanning.
But for suffix scanning, it seems I don't need to know the number of
lines, so the bytes approach can just ignore line structure for the
> And both of them could be fast enough, I assume, if you disable the hash
> calculation in the "get token" callbacks like you were talking about.
Granted, the difference could be minimal (I haven't actually tested).
However, an additional two comparisons (against \r and \n), for every
byte in the suffix of say 500Kb, for say 2000 diffs of a blame
operation, it might cost another couple of seconds :-).
> But I don't think that necessarily affects the main point. It looks
> like you've thoroughly investigated using a token based approach. Thank
> you for doing so. My initial feeling that it was worth investigating
> was in the hope that you might find some fairly straightforward and
> self-contained modification to the existing token-handling layer. I
> think the result of this investigation, in which you needed to add
> token-fetch-backwards callbacks and so on, shows that this approach is
> too complex. I don't want to see a complex implementation. Therefore I
> support your inclination to abandon that approach and use the byte-wise
> approach instead.
Yep, it just seems too complex (as also confirmed by Bill Tutt in the
The "pushback" stuff (which is inevitable, because we will always read
one token too far to discover the non-match, and that line needs to be
reread to calculate the hash) is also rather dirty/annoying (adds
It was definitely worth investigating, if only to give me some peace
of mind that we have considered the options (and it was a great
Received on 2010-12-02 21:05:47 CET