Re: Coding goals / requirements.

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Tue, 18 Jan 2011 10:35:00 +0100

Hi,

As the guy responsible for the quote that started this thread ([1]):

> Actually, what's a little bit troubling is that there are currently
> only 3 possible "file_len's", of which only 2 are used in practice:
> diff2 and diff3 (diff4 is not used in svn core, only in
> tools/diff/diff4). So, if we make a slow and a fast version, we could
> just as well make a XXX2 and an XXX3 version explicitly, and dispense
> with the need to pass arrays and array lenghts. Hmmm... generic (but
> more complex) code vs. a couple of specialized versions.

I'd like to point out that I didn't mean to criticize the API choices,
I was just weighing the options. In fact, I came up with the initial
API choice myself (passing an array of "files", instead of passing 2
resp. 3 resp. 4 files), because I wanted it to be generically useful.
In any case it's a new API in the libsvn_diff module, for making
prefix/suffix optimizations possible.

I don't want to hijack this high-level discussion to be a technical
one again, but it may provide some additional context to give some
more details about the thoughts behind it ...

In fact, I started out with the "simple" form: passing 2 file
arguments for regular diff. But then I discovered there was also
something like diff3 (with 3 files) and diff4 (with 4 files; not used
in core SVN). So I simply thought to myself: theoretically this
optimization should work with N files, so why not just make it so. And
I changed the API to accept an array of files.

This made the code more complex, but in reality it's always the same
approach (so once you 'get' it, it's easy to follow):

Things like:

    while(*file0.curp == *file1.curp)
    {
      ...
    }

are replaced with (actually "AND-ing" N (or N-1) conditions):

    for (i = 1, is_match = TRUE; i< file_len; i++)
      is_match = is_match && *file[0].curp == *file[i].curp;
    while(is_match)
    {
      ...
      for (i = 1, is_match = TRUE; i< file_len; i++)
        is_match = is_match&& *file[0].curp == *file[i].curp;
    }

So, to be very concrete, it's really the choice between:

(1) A generic set of functions, that work for N files, which are
signifcantly more complex than the N==2 case:

    datasources_open(file[])
    find_identical_prefix(file[])
    find_identical_suffix(file[])

(2) A specialized set of functions for 2, 3 and 4 files (4 is maybe
optional, because not used in svn core), which are a lot simpler, but
actually completely duplicate the higher level logic:

    datasources_open2(file0, file1)
    find_identical_prefix2(file0, file1)
    find_identical_suffix2(file0, file1)

    datasources_open3(file0, file1, file2)
    find_identical_prefix3(file0, file1, file2)
    find_identical_suffix3(file0, file1, file2)

    datasources_open4(file0, file1, file2, file3)
    find_identical_prefix4(file0, file1, file2, file3)
    find_identical_suffix4(file0, file1, file2, file3)

At the time, I felt that (1) was the right choice, WRT power/weight
balance, keeping the logic in one single place. But with stefan2's
proposed low-level optimizations of the algorithm, it's more or less
open for debate :). Those low-level optimizations (which are
definitely worth it) are much more difficult if you want to write them
generically. And apart from that, splitting out the functions in _2,
_3 and _4 variants is also a low-level optimization by itself.

Right now, I'm still trying to keep it generic (trying to integrate
stefan2's suggestions in a generic way; I'm assuming for now that the
low-level optimization coming from the splitting-out itself is not
significant (I'll have to test that)), but I'm not so sure anymore.
We'll see...

Anyway, all in all, I certainly don't feel like I've wasted my time,
because it was also a great learning experience for me (new to C). And
being a perfectionist, I want it to be as good as possible WRT the
power/weight ratio (the most bang (speed/features/...) for the buck
(complexity/maintainability/readability)) :-).

Cheers,

-- 
Johan
[1] http://svn.haxx.se/dev/archive-2011-01/0241.shtml
On Tue, Jan 18, 2011 at 7:37 AM, Arwin Arni <arwin_at_collab.net> wrote:
> Hi All,
>
> It is a very interesting notion that Gavin throws at us. I think it is very
> important for an open-source project to maintain it's code in a way that it
> is easy for a new guy (like me) to make quick and meaningful changes. Most
> open-source projects with a large development community ends up in a mess of
> perfectly working, yet highly unreadable code.
>
> However, as a pair of fresh eyes looking at the code, I can safely say that
> though being an open-source project, Subversion has managed to stay
> "readable". This can only be attributed to the hours of work spent on
> "over-engineering" the code (BTW, I personally don't think anything can be
> over-engineered. In my book, it is merely an synonym for perfection).
>
> There are parts of the code (in svn) that have been written with the notion
> of "getting things done". And these are the parts that I really find
> difficult to assimilate. I think time spent *earlier* to sanitize the code
> is time better spent, than trying to read the mind of the original
> implementer at a *later* point in time.
>
> Regards,
> Arwin Arni
>
>
>
> On Tuesday 18 January 2011 07:26 AM, Gavin Beau Baumanis wrote:
>>
>> Hi Brane,
>>
>> I'm pretty sure the context of the quote is along the lines of;
>>
>> Poor design and implementation proves to be a burden in terms of
>> maintenance costs,  in the long run.
>> And instead of having bums on seats for (entirely) new development,
>> manpower is, instead, wasted on maintenance tasks because of poor design /
>> lack of a prototype etc.
>>
>> I guess it is an implementation / coding practice question;
>> Would a developer's time not be better spent on;
>> Doing the "guts"of the job and at a later stage once the engineering is
>> proven to be accurate  / reflective of the requirements - then worry about
>> private / public API requirements.
>>
>> Especially in an OSS project where resources are lean and transient, it is
>> "my" (perhaps naive) view that spending x hours on writing an API that might
>> not ever be used by another consumer is in fact x hours wasted that could
>> have been spent on a more worthwhile task.
>> When the requirement of a service to be consumed comes to bear, that is
>> the time to create an appropriate API.
>>
>>  From my past experiences, I have created many an API that have never-ever
>> been used, purely because the design standard said an API was required,
>> though the engineering requirements of satisfying the task at hand negated
>> that requirement entirely.
>>
>> Again - I don't presume to know any better - and in fact I started the
>> thread because of a desire to hopefully learn from it,
>> I'm not trying to be deliberately argumentative - I am just a proponent of
>> a good debate that fleshes out better outcomes / knowledge - selfishly for
>> myself - and hopefully for others too.
>>
>> Gavin.
>>
>>
>>
>> On 18/01/2011, at 9:13 AM, Branko Čibej wrote:
>>
>>> On 17.01.2011 23:07, Gavin Beau Baumanis wrote:
>>>>
>>>> Hi Brane,
>>>> I certainly do take maintainability seriously.
>>>> What's that well-quoted figure?
>>>> Something like 80% of the cost of software development is spent in the
>>>> development phase?
>>>
>>> I believe it's "should be spent" rather than "is spent" ... in reality,
>>> I've yet to see a project that didn't incur horrendous maintenance costs
>>> as a result of shortcuts taken during development.
>>>
>>> -- Brane
>>>
>
>

Received on 2011-01-18 10:36:02 CET

This message: [ Message body ]
Next message: Noorul Islam K M: "Re: svn command line client's inconsistent behavior in handling multiple targets."
Previous message: Philip Martin: "Re: Some performance data"
In reply to: Arwin Arni: "Re: Coding goals / requirements."
Next in thread: Gavin Beau Baumanis: "Re: Coding goals / requirements."
Reply: Gavin Beau Baumanis: "Re: Coding goals / requirements."

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]