[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

RE: Diff lib

From: Sander Striker <striker_at_apache.org>
Date: 2003-01-09 00:59:32 CET

Hi,

More progress on this front.

Currently we have an implementation of two algorithms mixed
together:

"An O(NP) Sequence Comparison Algorithm"
by Wu, Manber and Meyers

"An O(ND) Difference Algorithm and Its Variations"
by Meyers.

Specifically variation 4b from the second paper is used and partially
implemented. It doesn't implement the linear space LCS recovery algorithm,
but does take advantage of scanning from both the start aswell as the end
of sequences being compared.

This kind of brings the longest common subsequence part of the code
up to speed. It can always be improved upon later on. I've been told
GNU diff uses some heuristic to speed things up, but I refuse to look
at their code because of the license issues. If someone could explain it
to me though, I would probably be all ears ;).

Furthermore there are some things on the TODO list which I
will echo here. I guess I could put it in the issue, but hey,
I'm already halfway through this post... Half of these are implementation
details:

 - The tree to hold the nodes should be replaced by something smarter.
   At a minimum it should be balanced, at best it should be fronted by
   a hash table.

 - We should be taking advantage of the knowledge that some tokens (lines)
   don't occur at all in another sequence. We should just skip these.
   If we don't we might aswell do away with the node tree and do comparison
   on the tokens (and/or their hashes) directly.

 - The public API should grow a hash * argument to return hashes in
   get_token().

 - Several cleanups need to be applied:
    - svn_diff__tree_insert_token() should return a svn_diff__node_t instead
      of a svn_diff__position_t.
    - svn_diff__position_t allocation should be isolated in svn_diff__get_tokens.
      Currently it is in svn_diff__tree_insert_token which is Wrong(tm).

 - The diff3 code is missing usefull output functions (see: diff_file.c).

 - We are in dire need for some tests of this library. The only real
   testing I've done are through gdb and 'diff', which didn't change since
   I last posted it:
     http://subversion.tigris.org/servlets/ReadMsg?list=dev&msgNo=13857.
   The way I use it is as follows:
     ./diff original modified >original.patch
     cp original modified2
     cat orignal.patch | patch modified2
     diff modified modified2 (should report no differences).

   The diff3 code is completely untested...

 - Some timed tests need to be done.

Just a heads up,

Sander

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Jan 9 01:00:18 2003

This is an archived mail posted to the Subversion Dev mailing list.