Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

From: Gavin Beau Baumanis <gavinb_at_thespidernet.com>
Date: Thu, 23 Dec 2010 13:44:40 +1100

Hi Johan,

I was intrigued by your requirement to create a large file for testing.
I remember from a really long time ago when I learnt C, that we used a specific algorithm for creating "natural" and "random" text.
With some help from Mr.Google found out about Markov Chains that look promising - I can't remember if that was what I learned about or not - but it looks like it might be a prove helpful none the less.

A little further Googlng and I found this specific post on stackoverflow.
http://stackoverflow.com/questions/1037719/how-can-i-quickly-create-large-1gb-textbinary-files-with-natural-content

No Idea if it is going to help you specifically or not... but there are quite a few ideas in the comments;
* Obtain a copy of the first 100MB from wikipedia - for example.

Finally, if you happen to a large enough file already, could you not use "split" (unix) function to give you a specific sized file?

Actually - I was so intrigued by the "challenge" of this - I have had a think of it over lunch;

Could you not do this?
(pseudo -code)

Start with a known "chunk" - say the license file.
get length of license file - for example only = 125bytes
append the chunk to itself until such time as you have the required size.
write the file once.

startSize = length(knownFile);
int numberOfLoops = log2(desiredFileSize / knownFile) ;
newFile = knownFile

Loop for numberOfLoops
{
newFile = newFile +newFile
}

write newFile;

Gavin

On 23/12/2010, at 11:51 AM, Johan Corveleyn wrote:

> On Wed, Dec 22, 2010 at 11:50 AM, Philip Martin
> <philip.martin_at_wandisco.com> wrote:
>> Johan Corveleyn <jcorvel_at_gmail.com> writes:
>>
>>> On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin
>>> <philip.martin_at_wandisco.com> wrote:
>>>> Johan Corveleyn <jcorvel_at_gmail.com> writes:
>>>>
>>>>> This makes the diff algorithm another 10% - 15%
>>>>> faster (granted, this was measured with my "extreme" testcase of a 1,5
>>>>> Mb file (60000 lines), of which most lines are identical
>>>>> prefix/suffix).
>>>>
>>>> Can you provide a test script? Or decribe the test more fully, please.
>>>
>>> Hmm, it's not easy to come up with a test script to test this "from
>>> scratch" (unless with testing diff directly, see below). I test it
>>> with a repository (a dump/load of an old version of our production
>>> repository) which contains this 60000 line xml file (1,5 Mb) with 2272
>>> revisions.
>>>
>>> I run blame on this file, over svnserve protocol on localhost (server
>>> running on same machine), with an svnserve built from Stefan^2's
>>> performance branch (with membuffer caching of full-texts, so server
>>> I/O is not the bottleneck). This gives me an easy way to call 2272
>>> times diff on this file, and measure it (with the help of some
>>> instrumentation code in blame.c, see attachment). And it's
>>> incidentally the actual use case I first started out wanting to
>>> optimize (blame for large files with many revisions).
>>
>> Testing with real-world data is important, perhaps even more important
>> than artificial test data, but some test data would be useful. If you
>> were to write a script to generate two test files of size 100MB, say,
>> then you could use the tools/diff/diff utility to run Subversion diff on
>> those two files. Or tools/diff/diff3 if it's a 3-way diff that matters.
>> The first run might involve disk IO, but on most machines the OS should
>> be able to cache the files and subsequent hot-cache runs should be a
>> good way to profile the diff code, assumming it is CPU limited.
>
> Yes, that's a good idea. I'll try to spend some time on that. But I'm
> wondering about a good way to write such a script.
>
> I'd like the script to generate large files quickly, and with content
> that's not totally random, but also not 1000000 times the exact same
> line (either of those are not going to be representative for real
> world data, might hit some edge behavior of the diff algorithm).
> (maybe totally random is fine, but is there an easy/fast way to
> generate this?)
>
> As a first attempt, I quickly hacked up a small shell script, writing
> out lines in a for loop, one by one, with a fixed string together with
> the line number (index of the iteration). But that's too slow (10000
> lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds).
>
> Maybe I can start with 10 or 20 different lines (or generate 100 in a
> for loop), and then start doubling that until I have enough (cat
> file.txt >> file.txt). That will probably be faster. And it might be
> "real-worldish" enough (a single source file also contains many
> identical lines, e.g. all lines with a single brace etc.).
>
> Other ideas? Maybe there is already something like this lying around?
>
> Another question: a shell script might not be good, because not
> portable (and not fast)? Should I use python for this? Maybe the
> "write line by line with a line number in a for loop" would be a lot
> faster in Python? I don't know a lot of python, but it might be a good
> opportunity to learn some ...
>
> Are there any examples of such "manual test scripts" in svn? So I
> could have a look at the style, coding habits, ... maybe borrow some
> boilerplate code?
>
> Cheers,
> --
> Johan
Received on 2010-12-23 03:45:27 CET

This message: [ Message body ]
Next message: Peter Samuelson: "Re: Setting $HOME in the testsuite"
Previous message: Johan Corveleyn: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"
In reply to: Johan Corveleyn: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"
Next in thread: Johan Corveleyn: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"
Reply: Johan Corveleyn: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"
Reply: Stefan Fuhrmann: "Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]