Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Thu, 23 Dec 2010 12:05:54 +0000

On Thu, 2010-12-23, Daniel Shahaf wrote:
> Daniel Shahaf wrote on Thu, Dec 23, 2010 at 13:25:40 +0200:
> > Johan Corveleyn wrote on Thu, Dec 23, 2010 at 01:51:08 +0100:
> > > Yes, that's a good idea. I'll try to spend some time on that. But I'm
> > > wondering about a good way to write such a script.
> > >
> > > I'd like the script to generate large files quickly, and with content
> > > that's not totally random, but also not 1000000 times the exact same
> > > line (either of those are not going to be representative for real
> > > world data, might hit some edge behavior of the diff algorithm).
> >
[...]
> >
> > > (maybe totally random is fine, but is there an easy/fast way to
> > > generate this?)
> > >
> > > As a first attempt, I quickly hacked up a small shell script, writing
> > > out lines in a for loop, one by one, with a fixed string together with
> > > the line number (index of the iteration). But that's too slow (10000
> > > lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds).
> > >
> > > Maybe I can start with 10 or 20 different lines (or generate 100 in a
> > > for loop), and then start doubling that until I have enough (cat
> > > file.txt >> file.txt). That will probably be faster. And it might be
> > > "real-worldish" enough (a single source file also contains many
> > > identical lines, e.g. all lines with a single brace etc.).
> > >
> > > Other ideas? Maybe there is already something like this lying around?
> > >
> > > Another question: a shell script might not be good, because not
> > > portable (and not fast)? Should I use python for this? Maybe the
> > > "write line by line with a line number in a for loop" would be a lot
> > > faster in Python? I don't know a lot of python, but it might be a good
> > > opportunity to learn some ...
> >
> > IMO, use whatever language is most convenient for you to write the
> > script in. (Generating the test data need not be fast since it's
> > a once-only task.)
>
> That is: *in my opinion* it doesn't need to be fast. But re-reading
> your mail, I gather you think otherwise.
>
> Why? I assumed you'd run the script once, generate a repository, then
> (commit that repository to ^/tags somewhere for safekeeping and) work
> with that repository thereafter without regeneraeting it each time; so
> generating wouldn't need to be fast.

That's OK if it's a private test but for a maintainable test it's much
better to generate any large data set on the fly. Then we can easily
tweak it to generate different data sizes, data with mismatching EOL
style, data with prefix matching only/suffix only/both, etc. And if the
test data size is in the order of a megabyte or more it's ugly to check
it in as part of the test suite in the project repo (even if it is
usually compressed in the repo and in transmission).

- Julian
Received on 2010-12-23 13:06:36 CET

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]