Re: Making blame even faster

From: Daniel Berlin <dberlin_at_dberlin.org>
Date: 2005-02-10 16:38:18 CET

On Thu, 2005-02-10 at 11:23 +0100, Branko Èibej wrote:
> Daniel Berlin wrote:
>
> >On Thu, 2005-02-10 at 03:57 +0100, Branko Èibej wrote:
> >
> >
> >>Daniel Berlin wrote:
> >>
> >>
> >>
> >>>1. For a simple byte level blame, we should be able to get the answers
> >>>
> >>>
> >>>from the diff format alone, without actually expanding it. You can
> >>
> >>
> >>>simply mark the affected byte ranges, and stop when the whole file has
> >>>been affected. What the output looks like is up for grabs.
> >>>
> >>>2. You could actually do a similar thing with line level blame.
> >>>Discover the line byte ranges of the current revision, throw it into an
> >>>interval tree, and then do queries to see which line a given diff
> >>>affects, based on it's byte ranges. Stop when every line has been
> >>>affected.
> >>>
> >>>
> >>>2 seems a bit harder than 1.
> >>>
> >>>I've been out of it for a while, so maybe there is a something about
> >>>implementing #1 i have missed.
> >>>Have i missed anything about #1 that makes it hard?
> >>>
> >>>
> >>>
> >>>
> >>If by "diff format" you mean the binary delta, then... There was a time
> >>when I thought this would be possible. Now I'm not so sure. The trouble
> >>is that the vdelta doesn't generate an edit stream, it generates a
> >>compressed block-copy stream. Which means that can never be 100% sure,
> >>just from looking at the delta, which bytes in the target are really new
> >>and which are just (offset) copies from the source. The only blocks you
> >>can really be sure about are those that are represented by new data in
> >>the delta (either NEW blocks or target copies that take data from NEW
> >>blocks). This problem is made worse by our use of skip deltas in (at
> >>least) the BDB back-end.
> >>
> >>I agree that it would be nice if the server could generate some sort of
> >>byte-range oriented info, but I don't think it can be done just by
> >>looking at the deltas. It's sad, I know...
> >>
> >>
> >>
> >After looking at the real slow,i'm not actually worried about this so
> >much anymore.
> >For now it's actually not the main source of slowdown for blame.
> >See my followup mail on the delta combiner.
> >
> >The delta combiner falls down badly when you have a even a moderate
> >number of target ops in the window. It goes n^2 and takes up a large
> >amount of the time
> >
> >
> You probably mean "target ops in the first window" or "source ops in the
> second window", yes?
>
yes.

> I think you're right... every single bit in the combiner is linear or
> logarithmic, except the tiny little recursion in copy_source_ops which
> can, indeed, go quadratic with the right mix of data. The problem seems
> to be that the right mix of data is far too easily acheivable, at least
> with vdelta...
.
Yup.
This is why i chose xdelta, and only ran it over the source, because
it's guaranteed not to hit this problem when combining happens (and
combining falls completely off the profile).

It also has another interesting nice property:

You are more or less guaranteed that all the same portions of the source
that exist in target will be represented by copies, and that all
different portions will be represented by inserts.

This is because it looks like so (It's early, so any parts that don't
make sense here should be read in connection with the code :P)
1. runs through with a rolling checksum checksumming every sepeate
<blocksize> bytes in the source, creating a match table
2. Hunt matching starting points in the target.
3. For every byte in the target that doesn't match a starting point in
the source, create an insert op for that byte (we rely on the fact that
the inserter knows to combine consecutive insert ops to make a single
long insert op out of this :P)
4. For every byte in the target that does match a starting point in the
source, extend the match as long as possible and create a copy from the
source.

This more or less means that you shouldn't end up with inserts unless
something was really different, because if it was the same, the prior
copy would have been extendable. This is different from vdelta, where
it was just generating a compressed block copy. This is a real binary
"edit script" that produces the second from the first.

>
> /me bangs head against wall
>
> Problem is I have no idea how to get rid of that recursion in
> copy_source_ops and still combine deltas correctly.

I'm pretty sure you can't, unfortunately. The current code is
theoretically sound :P.

>
> The more I look at your numbers, the more it seems that xdelta is a
> better choice. The worse compression seems to be balanced by better
> performance, which makes sense. Unless I've done something really stupid
> in vdelta or the combiner, we should probably look switching to xdelta
> in 1.2.
>
> -- Brane
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Feb 10 16:40:08 2005

This message: [ Message body ]
Next message: C. Michael Pilato: "Re: Making blame even faster"
Previous message: Branko ÄŒibej: "Re: delta combiner sloooooooowwwwww"
In reply to: Branko ÄŒibej: "Re: Making blame even faster"
Next in thread: C. Michael Pilato: "Re: Making blame even faster"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]