[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Using svn_hash__make instead of apr_hash__make

From: Justin Erenkrantz <justin_at_erenkrantz.com>
Date: Tue, 22 May 2012 17:30:26 -0700

On Tue, May 22, 2012 at 12:35 PM, Stefan Fuhrmann <eqfox_at_web.de> wrote:
>> Having a stable hash function sure doesn't seem like this would
>> account for that reduction.  Can you please elaborate?
>
>
> Subversion up to and including 1.7 will serialize directories
> as string->string hashes in FSFS. wordpress.org uses projects
> as the top-level of its repository (just like Apache). So, every
> commit writes a new version of that. At >26k projects, that's
>>1.4MB per revision.
>
> In 1.8, one may activate directory deltification. After serialization,
> the resulting text will be deltified just like any other node and
> the result be zip-compressed. Many revisions are now about 2KB.
> However, that hinges on successive versions of the directories
> to produce serialized text. Even a random "shift" by a larger
> number of entries will leave no 64 byte matches (our xdelta
> granularity) within the 100k text windows used by xdelta.

Thanks for the explanation. However, this is a very specific use case
- why can't we just use an explicitly sorted data structure here? I
know Greg coded up a red-black tree for pocore...I'm sure there are
plenty of guaranteed-sorted data structures under an ALv2 license. =)

> Well, I am a developer and reproducibility between test runs
> *does* matter to me.

If ordering matters, then the presentation layer should enforce
ordering. I would never expect a hash table to produce sorted
ordering across runs. (In fact, if it does, it probably means it has
a flaw in the implementation...which is exactly what happened with
APR.)

> On a more general note: We don't use hashes as a means to
> randomize our data. For us, they are simply containers with
> an average O(1) insertion and lookup behavior. The APR interface
> also allows for iterating that container - so it *has* an ordering
> and it has been quite stable under different operations and
> over many versions of APR.

Iterating isn't the same as ordering.

> The change in 1.4.6 did *not* solve the fundamental performance
> problem but it makes our life harder - at least for a while.
> If we want a reproducible UI behavior, we must now eliminate
> the use of hashes in all relevant output functions and replace
> them with e.g. sorted arrays. That may take some time.

Sure.

>> And, the third point doesn't make any sense to me without a further
>> explanation.  -- justin
>>
> When we e.g. do an "svn ls -v" (TSVN repo browser), we will
> create and fill the revprop hash for the respective revision
> multiple times for each entry in the directory - just to produce
> a few bytes of output. The hash function showed up in profiles
> with 10% of the total runtime.

Sorry - I'm not following how having a sorted hash table achieves
this; it sounds like reuse of a data structure (not requiring memory
allocs or something) - not that it is guaranteed to be stable on every
call.

> So, I tuned that. Because apr_hash_t is so popular in our code,
> that very localized improvement will give (small) returns in
> improved performance all over the place.

Yes, but there may very well be penalties in other places that you
haven't looked at yet. I know that I've seen shift+incr appear in
traces before - switching to a much more expensive hash algorithm is
going to make those cases worse. -- justin
Received on 2012-05-23 02:30:58 CEST

This is an archived mail posted to the Subversion Dev mailing list.