Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Thu, 23 Dec 2010 13:29:08 +0100

On Thu, Dec 23, 2010 at 1:05 PM, Julian Foad <julian.foad_at_wandisco.com> wrote:
> On Thu, 2010-12-23, Daniel Shahaf wrote:
>> Daniel Shahaf wrote on Thu, Dec 23, 2010 at 13:25:40 +0200:
>> > Johan Corveleyn wrote on Thu, Dec 23, 2010 at 01:51:08 +0100:
>> > > Yes, that's a good idea. I'll try to spend some time on that. But I'm
>> > > wondering about a good way to write such a script.
>> > >
>> > > I'd like the script to generate large files quickly, and with content
>> > > that's not totally random, but also not 1000000 times the exact same
>> > > line (either of those are not going to be representative for real
>> > > world data, might hit some edge behavior of the diff algorithm).
>> >
> [...]
>> >
>> > > (maybe totally random is fine, but is there an easy/fast way to
>> > > generate this?)
>> > >
>> > > As a first attempt, I quickly hacked up a small shell script, writing
>> > > out lines in a for loop, one by one, with a fixed string together with
>> > > the line number (index of the iteration). But that's too slow (10000
>> > > lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds).
>> > >
>> > > Maybe I can start with 10 or 20 different lines (or generate 100 in a
>> > > for loop), and then start doubling that until I have enough (cat
>> > > file.txt >> file.txt). That will probably be faster. And it might be
>> > > "real-worldish" enough (a single source file also contains many
>> > > identical lines, e.g. all lines with a single brace etc.).
>> > >
>> > > Other ideas? Maybe there is already something like this lying around?
>> > >
>> > > Another question: a shell script might not be good, because not
>> > > portable (and not fast)? Should I use python for this? Maybe the
>> > > "write line by line with a line number in a for loop" would be a lot
>> > > faster in Python? I don't know a lot of python, but it might be a good
>> > > opportunity to learn some ...
>> >
>> > IMO, use whatever language is most convenient for you to write the
>> > script in. (Generating the test data need not be fast since it's
>> > a once-only task.)
>>
>> That is: *in my opinion* it doesn't need to be fast. But re-reading
>> your mail, I gather you think otherwise.
>>
>> Why? I assumed you'd run the script once, generate a repository, then
>> (commit that repository to ^/tags somewhere for safekeeping and) work
>> with that repository thereafter without regeneraeting it each time; so
>> generating wouldn't need to be fast.
>
> That's OK if it's a private test but for a maintainable test it's much
> better to generate any large data set on the fly. Then we can easily
> tweak it to generate different data sizes, data with mismatching EOL
> style, data with prefix matching only/suffix only/both, etc. And if the
> test data size is in the order of a megabyte or more it's ugly to check
> it in as part of the test suite in the project repo (even if it is
> usually compressed in the repo and in transmission).

Yes, I wouldn't like to commit them (as you suggest, Julian, we might
want to generate different variants, of different sizes; I wouldn't
like to commit several 100Mb files for instance).

Also, I wouldn't like to depend on a website or wikipedia dump or
something like that (or even just depend on an internet connection,
for that matter).

Taking just a bunch of real sources from svn's source tree also
doesn't feel "quite right". It's not reproducible, since the sources
change.

I'm currently thinking of embedding into the script (or committing
next to the script) a significant chunk of test data with some
"real-worldish" content (say 1 Kb, or 10 or even 100 Kb), and use that
to generate abritrary length files by repeating that block (or by
doubling the file (cat file >> file) until it's large enough).

I'm a little bit wondering about what "real-worldish" would mean, but
maybe it's not that terribly important. I could always include
multiple variations (a piece of C code, a chunk of "Lorem ipsum" text,
a large xml file, ...).

Cheers,

-- 
Johan

Received on 2010-12-23 13:30:06 CET

This message: [ Message body ]
Next message: Johan Corveleyn: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"
Previous message: Julian Foad: "Re: [PATCH] svn info - changing hard coded error message to specific one."
In reply to: Julian Foad: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"
Next in thread: Daniel Shahaf: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]