On 22.11.2012 17:01, Julian Foad wrote:
>> Author: brane
>> Modified: subversion/trunk/subversion/include/private/svn_string_private.h
>> +/**
>> + * Computes the similarity score STRA and STRB, given as the ratio
>> + * between the length of their longest common subsequence and the
>> + * length of the strings, normalized to the range [0..1000].
> "longest common non-contiguous subsequence"
Yes, we know it is, but that's not what the name of the algorithm is.
> "average length of the two strings"
This is true; however, saying "average" always sounds, to me, as if
we're dealing with a whole bunch of values, rather than just two.
> Just use 'float' and [0..1]? I'm thinking we may want to use this for comparing diff hunks, and there may be more than 500 characters in each hunk and it matters whether just one character has been changed.
Could do that, too. Harder to test though. :) It's trivial to extend the
range to [0..10**6] or even 10**9 if that's a problem. Frankly it feels
nice to keep the whole thing integral and not leave floating-point
messes around.
> [...]
>
>> Modified: subversion/trunk/subversion/tests/libsvn_subr/string-test.c
>> +static svn_error_t *
>> +test_string_similarity(apr_pool_t *pool)
>> +{
>> + const struct sim_score_test_t
>> + {
>> + const char *stra;
>> + const char *strb;
>> + apr_size_t lcs;
>> + int score;
>> + } tests[] =
>> + {
>> +#define SCORE(lcs, len) ((2000 * (lcs) + (len)/2) / (len))
>> +
>> + /* Equality */
>> + {"", "", 0, 1000},
>> + {"quoth", "quoth", 5, SCORE(5, 5+5)},
>> +
>> + /* Deletion at start */
>> + {"quoth", "uoth", 4, SCORE(4, 5+4)},
>> + {"uoth", "quoth", 4, SCORE(4, 4+5)},
>> +
>> + /* Deletion at end */
>> + {"quoth", "quot", 4, SCORE(4, 5+4)},
>> + {"quot", "quoth", 4, SCORE(4, 4+5)},
>> +
>> + /* Insertion at start */
>> + {"quoth", "Xquoth", 5, SCORE(5, 5+6)},
>> + {"Xquoth", "quoth", 5, SCORE(5, 6+5)},
>> +
>> + /* Insertion at end */
>> + {"quoth", "quothX", 5, SCORE(5, 5+6)},
>> + {"quothX", "quoth", 5, SCORE(5, 6+5)},
> Half of the examples in each of these four pairs are redundant.
> (The word "deletion" implies an ordering of string-A -> string-B, so
> by that terminology "uoth" -> "quoth" is an Insertion, same as
> "quoth" -> "Xquoth".)
Theoretically and academically speaking, yes, they're redundant.
Practically, however, this helped me find a bug that only popped up when
exactly one of these examples was removed. I love symmetric algorithms
but my bugs are always asymmetric and some of them break causality, too. :)
So in order to avoid falling into the incompleteness trap, I just
mirrored all the test cases. It's less of a problem to have duplicate
tests than it is to miss one.
-- Brane
--
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com
Received on 2012-11-22 18:30:32 CET