[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: svn commit: r1412425 - in /subversion/trunk/subversion: include/private/svn_string_private.h libsvn_subr/string.c tests/libsvn_subr/string-test.c

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Thu, 22 Nov 2012 16:01:35 +0000 (GMT)

> Author: brane

> Date: Thu Nov 22 05:43:56 2012
> New Revision: 1412425
>
> URL: http://svn.apache.org/viewvc?rev=1412425&view=rev
> Log:
> Groundwork for issue #4261. Utility for calculating string similarity.
>
> * subversion/include/private/svn_string_private.h
>   (svn_cstring__similarity): New prototype.
> * subversion/libsvn_subr/string.c
>   (svn_cstring__similarity): Implement.
>
> * subversion/tests/libsvn_subr/string-test.c
>   (test_string_similarity): New test case.
>   (test_funcs): Register test_string_similarity.

> Modified: subversion/trunk/subversion/include/private/svn_string_private.h
> +/**
> + * Computes the similarity score STRA and STRB, given as the ratio
> + * between the length of their longest common subsequence and the
> + * length of the strings, normalized to the range [0..1000].

"longest common non-contiguous subsequence"

"average length of the two strings"

Just use 'float' and [0..1]?  I'm thinking we may want to use this for comparing diff hunks, and there may be more than 500 characters in each hunk and it matters whether just one character has been changed.

> + * The result is equivalent to Python's
> + *
> + *  difflib.SequenceMatcher.ratio
> + *
> + * Optionally sets *RLCS to the length of the longest common
> + * subsequence of STRA and STRB. Using BUFFER for temporary storage,
> + * requires memory proportional to the length of the shorter string.
[...]

> Modified: subversion/trunk/subversion/tests/libsvn_subr/string-test.c
> +static svn_error_t *
> +test_string_similarity(apr_pool_t *pool)
> +{
> +  const struct sim_score_test_t
> +  {
> +    const char *stra;
> +    const char *strb;
> +    apr_size_t lcs;
> +    int score;
> +  } tests[] =
> +      {
> +#define SCORE(lcs, len) ((2000 * (lcs) + (len)/2) / (len))
> +
> +        /* Equality */
> +        {"",      "",          0, 1000},
> +        {"quoth",  "quoth",    5, SCORE(5, 5+5)},
> +
> +        /* Deletion at start */
> +        {"quoth",  "uoth",      4, SCORE(4, 5+4)},
> +        {"uoth",  "quoth",    4, SCORE(4, 4+5)},
> +
> +        /* Deletion at end */
> +        {"quoth",  "quot",      4, SCORE(4, 5+4)},
> +        {"quot",  "quoth",    4, SCORE(4, 4+5)},
> +
> +        /* Insertion at start */
> +        {"quoth",  "Xquoth",    5, SCORE(5, 5+6)},
> +        {"Xquoth", "quoth",    5, SCORE(5, 6+5)},
> +
> +        /* Insertion at end */
> +        {"quoth",  "quothX",    5, SCORE(5, 5+6)},
> +        {"quothX", "quoth",    5, SCORE(5, 6+5)},

Half of the examples in each of these four pairs are redundant. 
(The word "deletion" implies an ordering of string-A -> string-B, so
 by that terminology "uoth" -> "quoth" is an Insertion, same as
"quoth" -> "Xquoth".)

> +        /* Insertion in middle */

(That is, "Insertion/deletion in middle".)

> +        {"quoth",  "quoXth",    5, SCORE(5, 5+6)},
> +        {"quoXth", "quoth",    5, SCORE(5, 6+5)},
> +
> +        /* Transposition at start */
> +        {"quoth",  "uqoth",    4, SCORE(4, 5+5)},
> +        {"uqoth",  "quoth",    4, SCORE(4, 5+5)},
> +
> +        /* Transposition at end */
> +        {"quoth",  "quoht",    4, SCORE(4, 5+5)},
> +        {"quoht",  "quoth",    4, SCORE(4, 5+5)},
> +
> +        /* Transposition in middle */
> +        {"quoth",  "qutoh",    4, SCORE(4, 5+5)},
> +        {"qutoh",  "quoth",    4, SCORE(4, 5+5)},

Same redundance in the three "transposition" pairs.

> +        /* Difference */
> +        {"quoth",  "raven",    0, SCORE(0, 5+5)},
> +        {"raven",  "quoth",    0, SCORE(0, 5+5)},

And this pair.

> +        {"x",      "",          0, SCORE(0, 1+0)},
> +        {"",      "x",        0, SCORE(0, 0+1)},
> +        {"",      "quoth",    0, SCORE(0, 0+5)},
> +        {"quoth",  "",          0, SCORE(0, 5+0)},
> +        {"quoth",  "the raven", 2, SCORE(2, 5+9)},
> +        {"the raven",  "quoth", 2, SCORE(2, 5+9)},

And this one.

> +        {NULL, NULL}
> +      };

- Julian
Received on 2012-11-22 17:02:12 CET

This is an archived mail posted to the Subversion Dev mailing list.