Re: Fwd: hash in subversions vdelta.c

From: Mikkel Fahnøe Jørgensen <mikkel_f_j_at_yahoo.dk>
Date: 2002-08-22 13:22:31 CEST

--- Branko ÄŒibej <brane@xbc.nu> skrev: > Mikkel
FahnÃ¸e JÃ¸rgensen wrote:

> Well, the hash table in vdelta.c is a bit special in
> that it expects
> conflicts, because it can hold duplicate keys. In
> fact, without that
> feature, the VDelta algorithm wouldn't work at all.
> :-)

There is nothing special about that. Any open hash
uses a link to deal with conflicts. VDelta is special
because some conflicts are intentional - these all
come from using the same key. You are just
implementing a multimap. You still want to reduce the
number of conflicts from the fingerprint hash when the
input keys are not identical. This is accomplished by
using a good hash function. In your source you could
choose the table size to be module of a prime and
you'd get less undesired conflicts. This approach
makes it difficult to grow the hash - which is not a
problem with the mix hash because it does not require
a prime. Apparently you do not grow the table at all -
this will break the hashtable performance above a
certain level - unless you use windows.

> >If VDelta is using a 64K window (or whatever) you
> get
> >an efficient upper limit on the hashtable which
> >affects the choice of indexing method. In
> particular
> >you can oversize a hash table to reduce conflicts
> and
> >you do not need to grow the hash nor risc conflicts
> >with virtual memory.
> >
> >
> Take a look at the hash table implementation, and
> the commentary at the
> top of vdelta.c that describes it. It's not a
> general-purpose table, as
> you seem to expect.

See aslo above.

VDelta can work with windowed buffers even if your
implementation does not. If you look at the ZDelta
implementation that is largely based on VDelta, you'll
find it uses a window. This both avoids degrading hash
performance and allows processing large files
buffered.

http://cis.poly.edu/zdelta/
http://cis.poly.edu/tr/tr-cis-2002-02.shtml

I have attached my IdHashTable.h. It does handle
multiple keys and it actually also maintains an
ordered collection - but that feature is not
particularly relevant to vdelta.
These two functions iterate over exactly the conflicts
that are of interest to vdelta and skips spurious
conflicts:

long GetIndexOf(long id)
long GetNextIndexOf(long id, long index)

For performance reasons you'd specialize this code for
vdelta lookup, but the hash is well suited for the
purpose.

Mikkel

Få den nye Yahoo! Messenger på www.yahoo.dk/messenger
Nu med webkamera, talechat, interaktive baggrunde og meget mere!

#ifndef IDHASHTABLE_H
#define IDHASHTABLE_H

/*
   Based on Bob Jenkins lookup2.c file, hash2 function,
   modified to specifically handle 32bit keys
   i.e. hash2 with length=1, initval=0
   note that according to Jenkins, the hashtable should be a power
   of two with excessive bits masked. That is, no prime number required.

http://burtleburtle.net/bob/c/lookup2.c
http://burtleburtle.net/bob/hash/hashfaq.html
*/

typedef unsigned long ub4; /* unsigned 4 byte value (32 bit) */
inline ub4 HashId(ub4 k)
{
ub4 a,b,c;

/* Set up the internal state */

    a = b = 0x9e3779b9; /* the golden ratio; an arbitrary value */
    c = 1;
    a += k;

    /* This is Bob Jenkins mix(a,b,c) macro embedded */
    a -= b; a -= c; a ^= (c>>13);
    b -= c; b -= a; b ^= (a<<8);
    c -= a; c -= b; c ^= (b>>13);
    a -= b; a -= c; a ^= (c>>12);
    b -= c; b -= a; b ^= (a<<16);
    c -= a; c -= b; c ^= (b>>5);
    a -= b; a -= c; a ^= (c>>3);
    b -= c; b -= a; b ^= (a<<10);
    c -= a; c -= b; c ^= (b>>15);

/*-------------------------------------------- report the result */
return c;
}

/*
HashTable for id's.
Maintains an ordered collection based on insertion order.

    Rehashing occurs during Insert when number of items exceeds 50% of the hash table.
    Shrinking occurs during Remove when less than 25% of the capacity is occupied.
    Complete deallocation only happens by calling Clear().
    No allocation takes place before first insertion.
*/

class IdHashTable
{
    struct IdHashItem
    {
        ub4 id;
        ub4 hash;
        long data;
        long next;
    };

    IdHashItem *m_pHashItems;
    long *m_pHashBuckets;
    long m_lLog2HashSize, m_lInitialLog2HashSize;
    long m_lMask;
    long m_lCapacity;
    long m_lCount;
    long m_lMissingValue;

    /*
        Compact points to end of consequtive elements in items table
        - or to capacity if no elements have been removed.
        This supports an ordered collection where compaction is needed
        only when an element is referenced by index which is above the
        compact marker.

        The Free points consequtive free memory at top of items table.
        if it is at Capacity during insertion, more space can be allocated
        by either compacting (if possible) or by expanding the table.

        In summary, this implements a lazy index that is nearly costfree when
        either no elements are removed, or when no elements are referenced by
        index.
    */
    long m_lCompact;
    long m_lFree;

enum { empty = -1, deleted = -2 };

void compact()
{

if(m_lCompact == m_lCapacity)
return;

        long i, offset;
        offset = 0;
        for(i = m_lCompact; i < m_lFree; ++i)
        {
            if(m_pHashItems[i].next = deleted)
            {
                ++offset;
                continue;
            }
            m_pHashItems[i - offset] = m_pHashItems[i];
        }
        m_lFree -= offset;
        m_lCompact = m_lCapacity;
        rehash();

}

    void rehash()
    {
        if(m_pHashBuckets == 0)
            return;
        long i = 0;
        while(i <= m_lMask)
            m_pHashBuckets[i++] = empty;

        for(i = 0; i < m_lCount; ++i)
        {
            IdHashItem &item = m_pHashItems[i];
            long k = item.hash & m_lMask;
            item.next = m_pHashBuckets[k];
            m_pHashBuckets[k] = i;
        }
    }

    void expand()
    {
        // expand ensures there is at least one available entry at m_lFree

        compact();
        if(m_pHashBuckets)
            free(m_pHashBuckets);

        m_lLog2HashSize++;
        if(m_lLog2HashSize < 2)
            m_lLog2HashSize = 2;

m_lMask = (1 << m_lLog2HashSize) - 1;
m_pHashBuckets = (long*)malloc((m_lMask + 1) * sizeof(long));

        // Capacity = 50% of bucket table. Makes sense as bucket table is much smaller than item table.
        m_lCapacity = 1 << (m_lLog2HashSize - 1);
        m_lCompact = m_lCapacity;

m_pHashItems = (IdHashItem*)realloc(m_pHashItems, m_lCapacity * sizeof(IdHashItem));

rehash();
}

    void shrink()
    {
        // "expand" to half the current size
        m_lLog2HashSize -= 2;
        expand();
    }

public:

    IdHashTable(long log2_hash_size = 4)
    {
        /*
            log2_hash_size is log2 of initial hash table size
        */
        m_lMissingValue = -1;
        m_pHashBuckets = 0;
        m_pHashItems = 0;
        m_lInitialLog2HashSize = log2_hash_size;
        Clear();
    }

    ~IdHashTable()
    {
        Clear();
    }

    /* The MissingValue is the value returned when the id or index is not found */
    void SetMissingValue(long id)
    {
        m_lMissingValue = id;
    }
    long GetMissingValue()
    {
        return m_lMissingValue;
    }

    /* clear doesn't affect the MissingValue */
    void Clear()
    {
        if(m_pHashBuckets)
        {
            free(m_pHashBuckets);
            m_pHashBuckets = 0;
        }
        if(m_pHashItems)
        {
            free(m_pHashItems);
            m_pHashItems = 0;
        }
        m_lLog2HashSize = m_lInitialLog2HashSize - 1;
        m_lCapacity = 0;
        m_lCount = 0;
        m_lCompact = m_lCapacity;
        m_lFree = 0;
        m_lMask = 0;
    }

/*
Same as Clear, except that it doesn't deallocate any memory

     */
    void Reset()
    {
        m_lCount = 0;
        m_lCompact = m_lCapacity;
        m_lFree = 0;
        rehash();
    }

/*
Flags: multi, overwrite.

        Creates a new entry if value doesn't exist, or if value exists and multi is true.
        If multi is false and value exists, the new value will replace the old value only.
        If multi is true, overwrite has no effect.
        Default setting is overwrite.
        If all flags are false, the hashtable is only modified if the key doesn't exist.

        With flags = multi, hashtable works like a set of stacks.

        Return value:
            The value that were inserted, or the value at the location if no update were made.

        The 'return_existing' flag returns the old data or missing value.
        'replace' is a combinations of return_existing and overwrite.
     */
    enum { no_overwrite = 0, multi = 1, overwrite = 2, return_existing = 4, replace = 6 };

    long Insert(long id, long data, long flags = overwrite)
    {
        if(m_lFree == m_lCapacity)
        {
            /* unnecessary if inserting into existing value - but does no harm */
            if(m_lCount >= m_lCapacity / 2) // the = in the test handles special case m_lCapacity = 0
                expand();
            else
                compact();
        }

        ub4 hash = HashId((ub4)id);

        long x, x0;
        x = x0 = m_pHashBuckets[hash & m_lMask];

        if(!(flags & multi))
        while(x != empty)
        {
            IdHashItem &item = m_pHashItems[x];
            if(item.id == id)
            {
                long ret_val;
                if(flags & return_existing)
                    ret_val = item.data;
                else
                    ret_val = data;
                if(flags & overwrite)
                {
                    /* overwrite existing data */
                    item.data = data;
                }
                return ret_val;
            }
            x = item.next;
        }
        IdHashItem &item = m_pHashItems[m_lFree];
        item.id = id, item.hash = hash, item.data = data, item.next = x0;
        m_pHashBuckets[hash & m_lMask] = m_lFree++;
        m_lCount++;
        if(flags & return_existing)
            return m_lMissingValue;
        return data;
    }

    /*
        If flags = multi, all occurences will be removed,
        otherwise only the last occurence if any.
        Returns number of items removed.

     */
    long Remove(long id, long flags = 0)
    {
        if(m_lCount == 0)
            return 0;

        ub4 hash = HashId((ub4)id);
        long precount = m_lCount;
        long x, x0;
        long *link = &m_pHashBuckets[hash & m_lMask];
        x = *link;
        x0 = empty;
        while(x != empty)
        {
            IdHashItem &item = m_pHashItems[x];
            if(item.id == id)
            {
                *link = item.next;
                if(x < m_lCompact)
                    m_lCompact = x;

// used during compaction
item.next = deleted;

                m_lCount--;
                if(!(flags & multi))
                    return precount - m_lCount;
                x = *link;
                continue;
            }
            link = &item.next;
            x = *link;
        }

if(m_lCount < m_lCapacity / 4)
shrink();

return precount - m_lCount;
}

    bool Exists(long id)
    {
        if(m_lCount == 0)
            return false;

ub4 k = HashId((ub4)id) & m_lMask;

long x = m_pHashBuckets[k];

        while(x != empty)
        {
            IdHashItem &item = m_pHashItems[x];
            if(item.id == id)
                return true;
            x = item.next;
        }
        return false;
    }

    long Find(long id)
    {
        if(m_lCount == 0)
            return m_lMissingValue;

ub4 k = HashId((ub4)id) & m_lMask;

long x = m_pHashBuckets[k];

        while(x != empty)
        {
            IdHashItem &item = m_pHashItems[x];
            if(item.id == id)
                return item.data;
            x = item.next;
        }
        return m_lMissingValue;
    }

    /* Id doesn't need to be present */
    static long GetHashOf(long id)
    {
        return HashId(id);
    }

    /* Returns a value between 0 and Count() - 1, or -1 if not found. */
    long GetIndexOf(long id)
    {
        if(m_lCount == 0)
            return -1;

ub4 k = HashId((ub4)id) & m_lMask;

long x = m_pHashBuckets[k];

        while(x != empty)
        {
            IdHashItem &item = m_pHashItems[x];
            if(item.id == id)
            {
                if(x >= m_lCompact)
                {
                    compact();
                    return GetIndexOf(id);
                }
                return x;
            }
            x = item.next;
        }
        return -1;
    }

    long GetNextIndexOf(long id, long index)
    {
        if(index < 0)
            return -1;
        if(index >= m_lCount)
            return -1;
        long k = index;

if(index >= m_lCompact)
compact();

        do
        {
            index = m_pHashItems[index].next;
            if(index >= m_lCompact)
            {
                compact();
                GetNextIndexOf(id, index);
            }
        }
        while(index >= 0 && m_pHashItems[index].id != id);
        return index;
    }

    long Count()
    {
        return m_lCount;
    }

    /*
       The index is generally supposed to be valid,
       but GetDataAt(index) allows invalid index in which case it returns
       GetMissingValue().
       This makes the index symmetric to the Find operation,
       and functionally equivalent to Find(id) <=> GetDataAt(GetIndexOf(id)).
     */
    long GetDataAt(long index)
    {
        if(index < 0 || index >= m_lCount)
            return m_lMissingValue;
        if(index >= m_lCompact)
            compact();
        return m_pHashItems[index].data;
    }

    long GetIdAt(long index)
    {
        if(index >= m_lCompact)
            compact();
        assert(index >= 0);
        assert(index < m_lCount);
        return m_pHashItems[index].id;
    }
    void SetDataAt(long index, long data)
    {
        if(index >= m_lCompact)
            compact();
        assert(index >= 0);
        assert(index < m_lCount);
        m_pHashItems[index].data = data;
    }
};

#endif

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Aug 22 13:23:15 2002

This message: [ Message body ]
Next message: Branko ÄŒibej: "Re: expected failures shouldn't raise alarms"
Previous message: Bill Tutt: "RE: Re: svn commit: rev 3010 - trunk/notes"
In reply to: Branko ÄŒibej: "Re: Fwd: hash in subversions vdelta.c"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]