Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Thu, 23 Dec 2010 13:34:40 +0200

Daniel Shahaf wrote on Thu, Dec 23, 2010 at 13:25:40 +0200:
> Johan Corveleyn wrote on Thu, Dec 23, 2010 at 01:51:08 +0100:
> > On Wed, Dec 22, 2010 at 11:50 AM, Philip Martin
> > <philip.martin_at_wandisco.com> wrote:
> > > Johan Corveleyn <jcorvel_at_gmail.com> writes:
> > >
> > >> On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin
> > >> <philip.martin_at_wandisco.com> wrote:
> > >>> Johan Corveleyn <jcorvel_at_gmail.com> writes:
> > >>>
> > >>>> This makes the diff algorithm another 10% - 15%
> > >>>> faster (granted, this was measured with my "extreme" testcase of a 1,5
> > >>>> Mb file (60000 lines), of which most lines are identical
> > >>>> prefix/suffix).
> > >>>
> > >>> Can you provide a test script? Or decribe the test more fully, please.
> > >>
> > >> Hmm, it's not easy to come up with a test script to test this "from
> > >> scratch" (unless with testing diff directly, see below). I test it
> > >> with a repository (a dump/load of an old version of our production
> > >> repository) which contains this 60000 line xml file (1,5 Mb) with 2272
> > >> revisions.
> > >>
> > >> I run blame on this file, over svnserve protocol on localhost (server
> > >> running on same machine), with an svnserve built from Stefan^2's
> > >> performance branch (with membuffer caching of full-texts, so server
> > >> I/O is not the bottleneck). This gives me an easy way to call 2272
> > >> times diff on this file, and measure it (with the help of some
> > >> instrumentation code in blame.c, see attachment). And it's
> > >> incidentally the actual use case I first started out wanting to
> > >> optimize (blame for large files with many revisions).
> > >
> > > Testing with real-world data is important, perhaps even more important
> > > than artificial test data, but some test data would be useful. If you
> > > were to write a script to generate two test files of size 100MB, say,
> > > then you could use the tools/diff/diff utility to run Subversion diff on
> > > those two files. Or tools/diff/diff3 if it's a 3-way diff that matters.
> > > The first run might involve disk IO, but on most machines the OS should
> > > be able to cache the files and subsequent hot-cache runs should be a
> > > good way to profile the diff code, assumming it is CPU limited.
> >
> > Yes, that's a good idea. I'll try to spend some time on that. But I'm
> > wondering about a good way to write such a script.
> >
> > I'd like the script to generate large files quickly, and with content
> > that's not totally random, but also not 1000000 times the exact same
> > line (either of those are not going to be representative for real
> > world data, might hit some edge behavior of the diff algorithm).
>
> How about using
>
> cat subversion/libsvn_wc/*.c
>
> as your test file?
>

As to time:

t1/subversion% time cat */*c | wc -c
cat: tests/libsvn_wc: Is a directory
9484278
cat */*c 0.00s user 0.05s system 4% cpu 1.248 total
wc -c 0.00s user 0.01s system 0% cpu 1.243 total

(but I ran 'make' earlier, so it might not be a cold cache)

>
> > (maybe totally random is fine, but is there an easy/fast way to
> > generate this?)
> >
> > As a first attempt, I quickly hacked up a small shell script, writing
> > out lines in a for loop, one by one, with a fixed string together with
> > the line number (index of the iteration). But that's too slow (10000
> > lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds).
> >
> > Maybe I can start with 10 or 20 different lines (or generate 100 in a
> > for loop), and then start doubling that until I have enough (cat
> > file.txt >> file.txt). That will probably be faster. And it might be
> > "real-worldish" enough (a single source file also contains many
> > identical lines, e.g. all lines with a single brace etc.).
> >
> > Other ideas? Maybe there is already something like this lying around?
> >
> > Another question: a shell script might not be good, because not
> > portable (and not fast)? Should I use python for this? Maybe the
> > "write line by line with a line number in a for loop" would be a lot
> > faster in Python? I don't know a lot of python, but it might be a good
> > opportunity to learn some ...
> >
>
> IMO, use whatever language is most convenient for you to write the
> script in. (Generating the test data need not be fast since it's
> a once-only task.)

That is: *in my opinion* it doesn't need to be fast. But re-reading
your mail, I gather you think otherwise.

Why? I assumed you'd run the script once, generate a repository, then
(commit that repository to ^/tags somewhere for safekeeping and) work
with that repository thereafter without regeneraeting it each time; so
generating wouldn't need to be fast.
Received on 2010-12-23 12:37:43 CET

This message: [ Message body ]
Next message: Daniel Shahaf: "Re: [PATCH] Issue #3653: svn update should not output svn:external fetches if they have not been updated"
Previous message: Daniel Shahaf: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"
In reply to: Daniel Shahaf: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"
Next in thread: Julian Foad: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"
Reply: Julian Foad: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]