Re: [patch] First stab at text deltas

From: Greg Hudson <ghudson_at_MIT.EDU>
Date: 2000-08-11 17:30:42 CEST

When you get back, I have a number of comments on vdelta.c.

The algorithm itself looks correct to me, although it could generate
short copies (less than four bytes) in the edge cases. The code might
come out a little simpler and not generate short copies if, instead of
having a single processing run with a source/target barrier, you have
one processing run for the source data and another one for the target
data (both runs using the same hash table, of course). That was my
experience with my mockup, although I wasn't handling the detail of
combining insert instructions.

Your instruction generation interface isn't necessarily a great fit
for vcdiff. vcdiff doesn't distinguish between "copy from source" and
"copy from target"; it just has "copy from the abstract string which
is the concatenation of the source and the target."

I feel like your hash table could be simplified and improved.
(Dramatically, if we don't have to deal with hash collisions, but I'll
assume that we do for now.) First and least importantly:

        typedef struct vd_table_t {
          size_t count; /* Number of elements in the table. */
          size_t capacity; /* Table capacity. */
          vd_slot_t data[1]; /* Slot array. data[0] is the head of
                                           the free-slot DL list. */
        } vd_table_t;

This trick isn't exactly kosher C, since it involves pointers past the
end of an array object. Are we okay with it for subversion? Making
data a pointer isn't noticeably more expensive.

Your hash table appears to use a variant on linear hashing which
limits insertion time to a constant (using a doubly-linked list of
free slots). As the table gets close to full, linear hashing is going
produce a large number of collisions, which will tend to degrade
search performance.

I would suggest we use separate chains at each hash slot, assuming we
have to deal with collisions. Since you seem reluctant to allocate
memory for the hash table in the middle of the run, separate the pool
of entries from the hash table itself, yielding a structure like:

        typedef struct vd_entry_t vd_entry_t;
        struct vd_entry_t {
          char const *key; /* Match candidate */
          vd_entry_t *next; /* Next entry in the chain */
        }

        /* Hash table. */
        typedef struct vd_table_t {
          size_t count; /* Number of elements in the table */
          size_t hashsize; /* Hash array size */
          size_t capacity; /* Hash table capacity */
          vd_entry_t **hash; /* Lookup table into hash entries */
          vd_entry_t *pool; /* Pool of chained hash entries */
        } vd_table_t;

where hash will point to hashsize "vd_entry_t *"s and pool will point
to capacity "vd_entry_t"s. When you insert entries, assign entries
sequentially from the pool (using "count") and chain them into the
singly linked list headed by table->hash[hashval]. Since we never
delete entries, there's no need to do any complicated management of
the pool, although we could maintain a singly-linked list of free pool
entries using the next pointer if we turned out to need deletion in
the future.

Also, your insertion function doesn't need a flag or whether to
replace an old equivalent entry. In the case where you call the
function with the flag set to false, replacement is impossible (since
we just looked up the entry in the hash table). So it's always okay
to just replace the entry.

I realize your hash table and management routines are a bit more
generic whereas the ones I visualize only supports the particular
operations needed by vdelta. I think in this case we want a tailored
data structure with only the stuff we need.

[The useful content of this mail pretty much ends here. The rest
devolves into a fine point of coding style.]

I ran into a style difference with the way you declare pointer
variables and the way the rest of the subversion code appears to. You
write:

char const* key;

which I've seen a lot from programmers who wish C had a sane type
declaration syntax; "*" is, after all, part of the type, not the
variable. However, C doesn't work that way, and if you or any naive
future programmer changes that line to:

char const* key, value;

that convention will fall down; you'll have "key" which is a const
char * and "value" which is a const char. Even if you always declare
only one variable at a time, a future programmer might not realize
that constraint.

Anyway, regardless of the merits, the subversion code should be
stylistically consistent.
Received on Sat Oct 21 14:36:06 2006

This message: [ Message body ]
Next message: Greg Hudson: "Re: CVS update: subversion/notes webdav-usage.html"
Previous message: Jim Blandy: "Re: Berkeley DB in our source tree?"
In reply to: Branko Èibej: "Re: [patch] First stab at text deltas"
Next in thread: Daniel Stenberg: "Re: [patch] First stab at text deltas"
Reply: Daniel Stenberg: "Re: [patch] First stab at text deltas"
Reply: Branko Èibej: "Re: [patch] First stab at text deltas"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]